uclnlp / jack

Jack the Reader
MIT License
257 stars 82 forks source link

Produce SNLI results on full dataset #128

Closed isabelleaugenstein closed 7 years ago

tdmeeste commented 7 years ago

With the changes I merged from the preprocess branch, we now get 82% dev set accuracy for the conditional model (no attention). Still need to do proper param sweep and further testing, but we are getting there and I'll first update the automated tests.

Note that dropout wasn't working with keep_prob=1 at evaluation time in the hooks, for which I added a quick fix in training_pipeline.py. Any suggestions for a more solid fix?

tdmeeste commented 7 years ago

The above was only dev set accuracy, and without the scaling down of 300 to 100 dimensions at the RNN input (unlike Tim's implementation).

My last results (still on an old version of JTR), gave got 80.9% accuracy on dev set, and 80.6% on test set, including that down-scaling. Other params were: p_keep=0.5, L2=3.e-6, batch_size=1024, epochs=100, lower_case=False, RNN_input_dim=100, learning_rate=0.001.

As mentioned: no hyperparam tuning yet, but I think we can already mention this in the demo paper.
I'll adapt the snli script for jack, reproduce that, and then do some hyperparam tuning.

rockt commented 7 years ago

That would be great, thanks!

tdmeeste commented 7 years ago

For now, let's report these hyperparams, with the 0.81 accuracy, all right @isabelleaugenstein? I guess we can update that for the camera-ready version, if need be? Then we'll also have a script in JTR to reach the accuracies reported in the paper.

isabelleaugenstein commented 7 years ago

Ok, let's do that then. Thanks!

tdmeeste commented 7 years ago

Currently working on the SNLI baseline in the new jack implementation:

tdmeeste commented 7 years ago

In the branch snli_new, I made some changes to increase efficiency and accuracy for SNLI experiments. For now, only testing the bidirectional conditional model.

Result of a (limited) hyperparam sweep: settings with best dev set accuracy (0.833) leads to test set accuracy of 0.831 (compared to 0.809 in @rockt's paper for the uni-directional conditional baseline).

There are quite some changes that may require discussion before integration in master, but I want to move on, so I'll stick to snli_new for now. Some changes involve a much simpler Vocab, written for efficiency: only loads symbols encountered in the data (with pretrained embeddings or not); running time is 6 hours for 2 simulations in parallel on a single GTX 1080. Adapted for JTR: no more double encouding of the data (previously done in setup_from_data and dataset_generator), and allows pruning the vocabulary (min_freq / max_size).

Next steps will be on the model level (other suggestions?)

rockt commented 7 years ago

Wow, that's really cool! Compared to my ICLR model I think the main differences are

Regarding next step suggestions, I'd prioritize as follows...

pminervini commented 7 years ago

About the "Exotic models" part - I'm currently working on integrating https://github.com/erickrf/multiffn-nli/ - on paper it should reach >88%, but let's see