npow / ubottu

Next Utterance Classification
http://arxiv.org/abs/1506.08909
136 stars 45 forks source link

Test and dev data sizes #12

Closed basma-b closed 6 years ago

basma-b commented 6 years ago

In the blob file you provide, the size of test and dev are 355170 and 356096 respectively although the original corpus has 189200 and 195600 samples. Can you explain please how did you get that number ?

npow commented 6 years ago

Where did you download the "original corpus" from?

basma-b commented 6 years ago

From the link you mentioned

wget http://cs.mcgill.ca/~npow1/data/ubuntu_blobs.tgz
tar zxvf blobs.tgz

and then when you do

import cPickle
train_data, val_data, test_data = cPickle.load(open('dataset.pkl', 'rb'))
print(len(train_data['c'])) # 1000192
print(len(val_data['c'])) # 356096
print(len(test_data['c'])) # 355170

So the numbers are not as supposed to be. For training I understand 1000192 instead of 1M because of the _pad_tobatch function that you have in preprocess.py but for test and dev I don't see why

basma-b commented 6 years ago

The original corpus is here https://github.com/rkadlec/ubuntu-ranking-dataset-creator numbers I mentioned are in the README.md file juste above the results section

npow commented 6 years ago

Those blobs were generated using v1 of the dataset, which can be found here: http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/

The version from rkadlec is v2.

basma-b commented 6 years ago

But using the data from rkdlec I can't get the same results as you in your paper

Dual Encoder LSTM model:

1 in 2: recall@1: 0.868730970907 1 in 10: recall@1: 0.552213717862 recall@2: 0.72099120433, recall@5: 0.924285351827

Dual Encoder RNN model:

1 in 2: recall@1: 0.776539210705, 1 in 10: recall@1: 0.379139142954, recall@2: 0.560689786585, recall@5: 0.836350355691,

TF-IDF model:

1 in 2: recall@1: 0.749260042283 1 in 10: recall@1: 0.48810782241 recall@2: 0.587315010571 recall@5: 0.763054968288

I get worse results, can you put the data that using your code we can reproduce the same please ?

npow commented 6 years ago

You can reproduce the results using the blobs from Joelle's website (linked above): http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/ubuntu_blobs.tgz

npow commented 6 years ago

Maybe you need to use different hyperparameters. In rkadlec's paper they were able to obtain better results

basma-b commented 6 years ago

Ok will use these blobs and hopefully I'll get the same results on the paper. Please update your README so people stop downloading the old blobs and use Joelle's ones instead.

npow commented 6 years ago

They are the same blobs, just hosted on Joelle's site instead of mine which is no longer active.

basma-b commented 6 years ago

But you said above they are V1 ? I need data to get the results reported here https://arxiv.org/pdf/1506.08909.pdf using V2

npow commented 6 years ago

The results in our paper (the one you linked to) is using the V1 dataset. If you run the code in this repo with the blobs from Joelle's site, you will be able to reproduce our results. You won't be able to get exactly the same results as us using V2, but I would expect that if you tune the hyperparameters you can get the same or better results, as shown here: https://arxiv.org/abs/1510.03753

basma-b commented 6 years ago

Ok I see, so if I understand well, the paper https://arxiv.org/abs/1510.03753 uses the latest version V2 ? that wont make sens to compare results on V2 with results from V1 as they did in table 1.

npow commented 6 years ago

No, I just double-checked and they are using V1. My point was just that you will probably need to do some tuning to get better results using V2.

basma-b commented 6 years ago

To match the original setup of [1] we use the same training data The dataset in binary format is available at http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_blobs.tgz [accessed 25.9.2015]

Yes they used V1, surprisingly almost all papers citing you paper are reporting results you obtained on V1 of the corpus but use V2 to test theirs

This ACL paper for example http://www.aclweb.org/anthology/P17-1046 they're giving different details in section 5.1 about your ubuntu corpus (0.5M test and dev this is not in V1 nor V2) but at the same time report your results on V1 in table 3 ..

Really confusing

npow commented 6 years ago

Yeah that's a bit weird. That ACL paper created their own train/val/test set using V1.

This .zip file includes the datasets (training/testint/validation) used in the experiments of paper:
Incorporating Loose-Structured Knowledge into LSTM with Recall-Gate for Conversation Modeling.

The datasets are extracted from the corpus: http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ 
Negtive sampling is conducted to produce balanced training set and 1:9 validation/testing sets following the paper of Lowe et al. (2015)

The details of the datasets are give below:
1. train.txt: 1 million training samples (pos:neg=1:1)
2. valid.txt: 50,000 samples for validation (pos:neg=1:9)
3. test.txt: 50,000 samples for testing (pos:neg=1:9)
4. vocab.txt: Vocabulary of the datasets. 
npow commented 6 years ago

Which papers have you seen using V2?

basma-b commented 6 years ago

Yes it's weird to compare results on different datasets .. [1] https://arxiv.org/pdf/1605.05110.pdf [2] https://arxiv.org/pdf/1605.00090.pdf

These two papers are using the dataset with 50.000 test and dev samples but see what they said in [1] in the first paragraph of page 7 still comparing non comparable results

npow commented 6 years ago

I don't see a problem there. In [1] they are running all the experiments on their train/test/split, which is consistent. Our model was included as a baseline, but they didn't quote the results we got in our paper.

basma-b commented 6 years ago

Anyway thank you @npow for making things clear. I'm closing the issue.