Produce SNLI results on full dataset

tdmeeste commented 7 years ago

With the changes I merged from the preprocess branch, we now get 82% dev set accuracy for the conditional model (no attention). Still need to do proper param sweep and further testing, but we are getting there and I'll first update the automated tests.

Note that dropout wasn't working with keep_prob=1 at evaluation time in the hooks, for which I added a quick fix in training_pipeline.py. Any suggestions for a more solid fix?

tdmeeste commented 7 years ago

The above was only dev set accuracy, and without the scaling down of 300 to 100 dimensions at the RNN input (unlike Tim's implementation).

My last results (still on an old version of JTR), gave got 80.9% accuracy on dev set, and 80.6% on test set, including that down-scaling. Other params were: p_keep=0.5, L2=3.e-6, batch_size=1024, epochs=100, lower_case=False, RNN_input_dim=100, learning_rate=0.001.

As mentioned: no hyperparam tuning yet, but I think we can already mention this in the demo paper.
I'll adapt the snli script for jack, reproduce that, and then do some hyperparam tuning.

rockt commented 7 years ago

That would be great, thanks!

tdmeeste commented 7 years ago

For now, let's report these hyperparams, with the 0.81 accuracy, all right @isabelleaugenstein? I guess we can update that for the camera-ready version, if need be? Then we'll also have a script in JTR to reach the accuracies reported in the paper.

isabelleaugenstein commented 7 years ago

Ok, let's do that then. Thanks!

tdmeeste commented 7 years ago

Currently working on the SNLI baseline in the new jack implementation:

[x] issues with various recent modifications with Vocab (rewrite Vocab, adapt NeuralVocab)
[x] solve inconsistent or ignored arguments in train_reader.py
[x] update simple_mcqa.SingleSupportFixedClassInputs module with missing arguments
[x] update pipeline
[x] dropout issue
[x] run baseline experiments to reach same numbers as in old quebap implementation
[x] hyperparam sweep for literature results

tdmeeste commented 7 years ago

In the branch snli_new, I made some changes to increase efficiency and accuracy for SNLI experiments. For now, only testing the bidirectional conditional model.

Result of a (limited) hyperparam sweep: settings with best dev set accuracy (0.833) leads to test set accuracy of 0.831 (compared to 0.809 in @rockt's paper for the uni-directional conditional baseline).

Fixed settings:
- 100 epochs;
- 300D word2vec embeddings are transformed to 100D representations for the RNN;
- L2 regularization weight 1.e-5;
- batch size 1024.
Tuned settings:
- out-of-vocab embedding initialization (normal / uniform): uniform (minor impact)
- dropout (0.3 / 0.4 / 0.5): 0.3
- initial Adam learning rate (0.01 / 0.005): 0.005
- normalize embeddings (True / False): True
Some observations:
- normalizing pre-trained embeddings + initialization of out-of-vocab embeddings with expected unit norm: seems to help consistently (+ 1 percentage point)
- using pre-trained embeddings for test tokens not seen during training doesn't make a noticeable difference (at least when the vocabulary built on training data is not pruned).

There are quite some changes that may require discussion before integration in master, but I want to move on, so I'll stick to snli_new for now. Some changes involve a much simpler Vocab, written for efficiency: only loads symbols encountered in the data (with pretrained embeddings or not); running time is 6 hours for 2 simulations in parallel on a single GTX 1080. Adapted for JTR: no more double encouding of the data (previously done in setup_from_data and dataset_generator), and allows pruning the vocabulary (min_freq / max_size).

Next steps will be on the model level (other suggestions?)

test the attentional baseline
more exotic models
new snli corpus
jackathon: merge with master

rockt commented 7 years ago

Wow, that's really cool! Compared to my ICLR model I think the main differences are

Bidirectional!
Much larger batch size
Larger dropout
Unit norm initialization (cool idea!)

Regarding next step suggestions, I'd prioritize as follows...

New SNLI corpus
Attention
Exotic models

pminervini commented 7 years ago

About the "Exotic models" part - I'm currently working on integrating https://github.com/erickrf/multiffn-nli/ - on paper it should reach >88%, but let's see

uclnlp / jack

Produce SNLI results on full dataset #128