Training with SemEval STS data set

hinhmd commented 7 years ago

Hi, I want to use your model on the SemEval STS data set. The training set has 22592 pairs of sentences. I got out of memory on the GPU (2 GPUs (NVIDIA® Tesla® K80) 24 GB GDDR5): "Runtime Error: cuda runtime error (2): out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:66". And in the model: n_feat = n_feat_h + n_feat_v + EXT_FEATS # n_feat = 44427

Do I need to reduce n_feat or increase the GPU's memory to be able to train on the STS? Do I need to change anything else? Thanks.

tuzhucheng commented 7 years ago

I @hinhmd, thanks for creating this issue!

I am actually working on integrating more datasets to this model too. Currently I'm working on integrating TrecQA and WikiQA. Would you like me to do that for you? Goal is to have to done in one week max but ideally ~3 days.

Currently to reduce memory usage the the master branch version of the code the easiest is probably specifying your own custom word vectors file using --word-vectors-file with fewer dimensions and adjust the places that use 300 word dimension with word vectors with fewer dimensions. I actually have another version of the code where it does not batch load all data into GPU memory at once that can handle very large datasets, but this is on a local branch. I will aim to have that ready in a day and update you on that.

hinhmd commented 7 years ago

Thank for your quick response. Simply, I tried replacing the SICK dataset with the STS data set. And here is the result with 26 epochs: INFO - pearson_r spearman_r KL-divergence loss INFO - test 0.98916 0.988053986236 0.07485739517211915

I'm still waiting for your better version.

tuzhucheng commented 7 years ago

Cool, which year's STS dataset are you using this on?

What's the state-of-the-art on STS?

tuzhucheng commented 7 years ago

If you need this in a hurry you can checkout this class which is on a separate branch: https://github.com/tuzhucheng/MP-CNN-Variants/blob/trec-qa-dataset/dataset.py#L264.

Notice how it has copy_to_gpu_once = False - this should save you some memory. You can checkout that branch and also set copy_to_gpu_once = False for your new dataset.

In a few days I'll update master branch with improved code that uses torchtext for all existing datasets and new ones - STS is on my list too.

hinhmd commented 7 years ago

I use 2015 STS data set (2015.test.tsv, 2015.train.tsv, 2015.val.tsv). The torchtext version runs faster and there isn't problem with memory.

However, with 45 epochs, I have not yet reached the early stopping (0.0002). I still have not configured to combat overfitting.

tuzhucheng commented 7 years ago

Ok I'll integrate those into this repo and do some hyper parameter tuning. I think you might want to try --epsilon 0.01 instead of default --epsilon 1e-8.

hinhmd commented 7 years ago

Well, with 39 epochs and --epsilon 0.01 : INFO - Evaluation metrics for test INFO - pearson_r spearman_r KL-divergence loss INFO - test 0.987094 0.987157394081 0.06778954442342122

tuzhucheng commented 7 years ago

That seems really high based on results on http://alt.qcri.org/semeval2015/task2/index.php?id=results?

hinhmd commented 7 years ago

Yeah. I simply replace sick data set with sts 2015 with number class = 6 (I do not use the id.txt file). Hope you will integrate it to get better version.

Thank you for taking the time for me. I plan to use this model for a small demo or use it to calculate the similarity between the texts. Could you please give me some advice? Thank you.

tuzhucheng commented 7 years ago

No problem! I am also just getting into this, not sure what advice would be helpful. What kind of advice are you look for?

hinhmd commented 7 years ago

How can I apply this model with two documents instead of two sentences?
Is there a better way to load the glove and embed the text in the real application?

tuzhucheng commented 7 years ago

I do not know about the effectiveness of CNN on longer documents.

In a real application, you might want to export the PyTorch to ONNX and run the model in Caffe2 - more suited for production. I do not believe ONNX currently supports embedding layer, so you might want to try exporting embedding layer to some serialization format like Avro or protobuf or use a subset of word vectors from the original GloVe file.

hinhmd commented 7 years ago

Again, Thank you.

tuzhucheng / MP-CNN-Variants

Training with SemEval STS data set #1