Closed hinhmd closed 7 years ago
I @hinhmd, thanks for creating this issue!
I am actually working on integrating more datasets to this model too. Currently I'm working on integrating TrecQA and WikiQA. Would you like me to do that for you? Goal is to have to done in one week max but ideally ~3 days.
Currently to reduce memory usage the the master branch version of the code the easiest is probably specifying your own custom word vectors file using --word-vectors-file
with fewer dimensions and adjust the places that use 300 word dimension with word vectors with fewer dimensions. I actually have another version of the code where it does not batch load all data into GPU memory at once that can handle very large datasets, but this is on a local branch. I will aim to have that ready in a day and update you on that.
Thank for your quick response.
Simply, I tried replacing the SICK dataset with the STS data set.
And here is the result with 26 epochs:
INFO - pearson_r spearman_r KL-divergence loss
INFO - test 0.98916 0.988053986236 0.07485739517211915
I'm still waiting for your better version.
Cool, which year's STS dataset are you using this on?
What's the state-of-the-art on STS?
If you need this in a hurry you can checkout this class which is on a separate branch: https://github.com/tuzhucheng/MP-CNN-Variants/blob/trec-qa-dataset/dataset.py#L264.
Notice how it has copy_to_gpu_once = False
- this should save you some memory. You can checkout that branch and also set copy_to_gpu_once = False
for your new dataset.
In a few days I'll update master
branch with improved code that uses torchtext
for all existing datasets and new ones - STS is on my list too.
I use 2015 STS data set (2015.test.tsv, 2015.train.tsv, 2015.val.tsv). The torchtext version runs faster and there isn't problem with memory.
However, with 45 epochs, I have not yet reached the early stopping (0.0002). I still have not configured to combat overfitting.
Ok I'll integrate those into this repo and do some hyper parameter tuning. I think you might want to try --epsilon 0.01
instead of default --epsilon 1e-8
.
Well, with 39 epochs and --epsilon 0.01
:
INFO - Evaluation metrics for test
INFO - pearson_r spearman_r KL-divergence loss
INFO - test 0.987094 0.987157394081 0.06778954442342122
That seems really high based on results on http://alt.qcri.org/semeval2015/task2/index.php?id=results?
Yeah. I simply replace sick
data set with sts 2015
with number class = 6 (I do not use the id.txt file). Hope you will integrate it to get better version.
Thank you for taking the time for me. I plan to use this model for a small demo or use it to calculate the similarity between the texts. Could you please give me some advice? Thank you.
No problem! I am also just getting into this, not sure what advice would be helpful. What kind of advice are you look for?
glove
and embed the text in the real application?I do not know about the effectiveness of CNN on longer documents.
In a real application, you might want to export the PyTorch to ONNX and run the model in Caffe2 - more suited for production. I do not believe ONNX currently supports embedding layer, so you might want to try exporting embedding layer to some serialization format like Avro or protobuf or use a subset of word vectors from the original GloVe file.
Again, Thank you.
Hi, I want to use your model on the SemEval STS data set. The training set has 22592 pairs of sentences. I got out of memory on the GPU (2 GPUs (NVIDIA® Tesla® K80) 24 GB GDDR5):
"Runtime Error: cuda runtime error (2): out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:66".
And in the model:n_feat = n_feat_h + n_feat_v + EXT_FEATS # n_feat = 44427
Do I need to reduce
n_feat
or increase the GPU's memory to be able to train on the STS? Do I need to change anything else? Thanks.