nlx-group / Replicating-Bogdanova-et-al.-2015-Duplicate-Question-Detection

Replicating Bogdanova et al., 2015 Duplicate Question Detection
2 stars 0 forks source link

Replication on Quora dataset? #2

Open andra-pumnea opened 6 years ago

andra-pumnea commented 6 years ago

Hi! I am trying to run an experiment on Quora dataset. I am using the dataset split provided by: https://github.com/zhiguowang/BiMPM and created a quora.w2v file similarly to askubuntu.w2v and meta.w2v. I got the following error:

Using Theano backend. INFO:Reading training sentence pairs from data/quora/train.tsv: / 298204 Elapsed Time: 0:10:34 /home/andrada.pumnea/anaconda3/lib/python3.6/site-packages/bs4/init.py:219: UserWarning: "b'.'" looks like a filename, not markup. You shouldprobably open this file and pass the filehandle intoBeautiful Soup. 'Beautiful Soup.' % markup) | 384347 Elapsed Time: 0:13:40 INFO:...read 384348 pairs in 820.31 seconds. INFO:...class distribution: 0 = 245042 (63.8%) | 1 = 139306 (36.2%) INFO:Reading validation sentence pairs from data/quora/dev.tsv: | 9999 Elapsed Time: 0:00:21 INFO:...read 10000 pairs in 21.21 seconds. INFO:...class distribution: 0 = 5000 (50.0%) | 1 = 5000 (50.0%) INFO:Reading testing sentence pairs from data/quora/test.tsv: | 9999 Elapsed Time: 0:00:21 INFO:...read 10000 pairs in 21.26 seconds. INFO:...class distribution: 0 = 5000 (50.0%) | 1 = 5000 (50.0%) INFO:Vectorizing data: INFO:...fitted tokenizer in 14.60 seconds; INFO:...found 103831 unique tokens; INFO:Load embeddings from models/quora2.w2v: INFO:...read 36111 word embeddings in 2.82 seconds; INFO:...created embedding matrix with shape (103832, 200); INFO:...cached matrix in file models/quora2.w2v.min.cache.npy. INFO:Creating CNN model: INFO:...model created. INFO:Compiling model: INFO:...model 0105d13fe81945018824e64905d8f7ad compiled with optimizer: <keras.optimizers.SGD object at 0x7fd9dd23cef0>, lr (sgd-only): 0.005, loss: mse. Model summary:


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, None) 0


input_2 (InputLayer) (None, None) 0


embedding_1 (Embedding) (None, None, 200) 20766400 input_1[0][0] input_2[0][0]


convolution1d_1 (Convolution1D) (None, None, 300) 180300 embedding_1[0][0] embedding_1[1][0]


globalmaxpooling1d_1 (GlobalMaxPo(None, 300) 0 convolution1d_1[0][0] convolution1d_1[1][0]


activation_1 (Activation) (None, 300) 0 globalmaxpooling1d_1[0][0] globalmaxpooling1d_1[1][0]


merge_1 (Merge) (None, 1) 0 activation_1[0][0] activation_1[1][0]

Total params: 20946700


INFO:Train on 384348 samples, validate on 10000 samples INFO:Epoch 1/1 2% (11127 of 384348) |### | Elapsed Time: 0:23:50 ETA: 13:16:51 Parameter 8 to routine SGEMM NTCSGEMV SGER was incorrect Floating point exception (core dumped)

I am using Ubuntu 16.04.3.

Any idea why it happened and how it can be fixed?

joaoantonioverdade commented 6 years ago

At first sight, I would say it is a memory problem, for such larger dataset there is not enough memory in the computer. Watch the computer memory while training or try it with an incremental data approach.

andra-pumnea commented 6 years ago

I tried with smaller sample of Quora dataset (24k/6k/1k) and it still craches with the same error

Parameter 8 to routine SGEMM NTCSGEMV SGER was incorrect Floating point exception (core dumped)

joaoantonioverdade commented 6 years ago

Does it run with the datasets we provide? Are the requirement libs installed? (requirements.txt)

andra-pumnea commented 6 years ago

It runs with the provided datasets. I also installed the requirements. These the packages installed in my env:

beautifulsoup4==4.5.3 certifi==2018.4.16 chardet==3.0.4 h5py==2.7.1 idna==2.7 Keras==1.1.0 nltk==3.2 numpy==1.12.0 progressbar2==3.12.0 pymystem3==0.1.5 python-utils==2.3.0 PyYAML==3.12 requests==2.19.1 scipy==1.1.0 six==1.11.0 Theano==0.8.2 urllib3==1.23

And this is the dataset I'm trying to run it on: https://drive.google.com/open?id=1-TV22E2ZY-NqGHIYiFa5r1eF6bWOs1ar

I generated my own quora.w2v with the following command: ./word2vec -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 0 -iter 3

Any clue on why I am getting the error?

joaoantonioverdade commented 6 years ago

Using a different dataset and embeddings falls outside the scope of this repository and our work.

Nevertheless, if it runs with the provided dataset and embeddings I would say the problem must be the new dataset or the new embeddings.

There are some issues with the train.tsv you provided:

I manage to train with 1000 samples from the dataset you provided by using the meta.w2v embeddings and changing the code to accept a smaller vocabulary.

Check if all the vocabulary from the dataset is represented in the embeddings. Check if there are no encoding problem.