Open andra-pumnea opened 6 years ago
At first sight, I would say it is a memory problem, for such larger dataset there is not enough memory in the computer. Watch the computer memory while training or try it with an incremental data approach.
I tried with smaller sample of Quora dataset (24k/6k/1k) and it still craches with the same error
Parameter 8 to routine SGEMM NTCSGEMV SGER was incorrect Floating point exception (core dumped)
Does it run with the datasets we provide? Are the requirement libs installed? (requirements.txt)
It runs with the provided datasets. I also installed the requirements. These the packages installed in my env:
beautifulsoup4==4.5.3 certifi==2018.4.16 chardet==3.0.4 h5py==2.7.1 idna==2.7 Keras==1.1.0 nltk==3.2 numpy==1.12.0 progressbar2==3.12.0 pymystem3==0.1.5 python-utils==2.3.0 PyYAML==3.12 requests==2.19.1 scipy==1.1.0 six==1.11.0 Theano==0.8.2 urllib3==1.23
And this is the dataset I'm trying to run it on: https://drive.google.com/open?id=1-TV22E2ZY-NqGHIYiFa5r1eF6bWOs1ar
I generated my own quora.w2v with the following command: ./word2vec -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 0 -iter 3
Any clue on why I am getting the error?
Using a different dataset and embeddings falls outside the scope of this repository and our work.
Nevertheless, if it runs with the provided dataset and embeddings I would say the problem must be the new dataset or the new embeddings.
There are some issues with the train.tsv you provided:
I manage to train with 1000 samples from the dataset you provided by using the meta.w2v embeddings and changing the code to accept a smaller vocabulary.
Check if all the vocabulary from the dataset is represented in the embeddings. Check if there are no encoding problem.
Hi! I am trying to run an experiment on Quora dataset. I am using the dataset split provided by: https://github.com/zhiguowang/BiMPM and created a quora.w2v file similarly to askubuntu.w2v and meta.w2v. I got the following error:
Using Theano backend. INFO:Reading training sentence pairs from data/quora/train.tsv: / 298204 Elapsed Time: 0:10:34 /home/andrada.pumnea/anaconda3/lib/python3.6/site-packages/bs4/init.py:219: UserWarning: "b'.'" looks like a filename, not markup. You shouldprobably open this file and pass the filehandle intoBeautiful Soup. 'Beautiful Soup.' % markup) | 384347 Elapsed Time: 0:13:40 INFO:...read 384348 pairs in 820.31 seconds. INFO:...class distribution: 0 = 245042 (63.8%) | 1 = 139306 (36.2%) INFO:Reading validation sentence pairs from data/quora/dev.tsv: | 9999 Elapsed Time: 0:00:21 INFO:...read 10000 pairs in 21.21 seconds. INFO:...class distribution: 0 = 5000 (50.0%) | 1 = 5000 (50.0%) INFO:Reading testing sentence pairs from data/quora/test.tsv: | 9999 Elapsed Time: 0:00:21 INFO:...read 10000 pairs in 21.26 seconds. INFO:...class distribution: 0 = 5000 (50.0%) | 1 = 5000 (50.0%) INFO:Vectorizing data: INFO:...fitted tokenizer in 14.60 seconds; INFO:...found 103831 unique tokens; INFO:Load embeddings from models/quora2.w2v: INFO:...read 36111 word embeddings in 2.82 seconds; INFO:...created embedding matrix with shape (103832, 200); INFO:...cached matrix in file models/quora2.w2v.min.cache.npy. INFO:Creating CNN model: INFO:...model created. INFO:Compiling model: INFO:...model 0105d13fe81945018824e64905d8f7ad compiled with optimizer: <keras.optimizers.SGD object at 0x7fd9dd23cef0>, lr (sgd-only): 0.005, loss: mse. Model summary:
Layer (type) Output Shape Param # Connected to
input_1 (InputLayer) (None, None) 0
input_2 (InputLayer) (None, None) 0
embedding_1 (Embedding) (None, None, 200) 20766400 input_1[0][0] input_2[0][0]
convolution1d_1 (Convolution1D) (None, None, 300) 180300 embedding_1[0][0] embedding_1[1][0]
globalmaxpooling1d_1 (GlobalMaxPo(None, 300) 0 convolution1d_1[0][0] convolution1d_1[1][0]
activation_1 (Activation) (None, 300) 0 globalmaxpooling1d_1[0][0] globalmaxpooling1d_1[1][0]
merge_1 (Merge) (None, 1) 0 activation_1[0][0] activation_1[1][0]
Total params: 20946700
INFO:Train on 384348 samples, validate on 10000 samples INFO:Epoch 1/1 2% (11127 of 384348) |### | Elapsed Time: 0:23:50 ETA: 13:16:51 Parameter 8 to routine SGEMM NTCSGEMV SGER was incorrect Floating point exception (core dumped)
I am using Ubuntu 16.04.3.
Any idea why it happened and how it can be fixed?