Closed basma-b closed 6 years ago
Where did you download the "original corpus" from?
From the link you mentioned
wget http://cs.mcgill.ca/~npow1/data/ubuntu_blobs.tgz
tar zxvf blobs.tgz
and then when you do
import cPickle
train_data, val_data, test_data = cPickle.load(open('dataset.pkl', 'rb'))
print(len(train_data['c'])) # 1000192
print(len(val_data['c'])) # 356096
print(len(test_data['c'])) # 355170
So the numbers are not as supposed to be. For training I understand 1000192 instead of 1M because of the _pad_tobatch function that you have in preprocess.py but for test and dev I don't see why
The original corpus is here https://github.com/rkadlec/ubuntu-ranking-dataset-creator numbers I mentioned are in the README.md file juste above the results section
Those blobs were generated using v1 of the dataset, which can be found here: http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/
The version from rkadlec is v2.
But using the data from rkdlec I can't get the same results as you in your paper
Dual Encoder LSTM model:
1 in 2: recall@1: 0.868730970907 1 in 10: recall@1: 0.552213717862 recall@2: 0.72099120433, recall@5: 0.924285351827
Dual Encoder RNN model:
1 in 2: recall@1: 0.776539210705, 1 in 10: recall@1: 0.379139142954, recall@2: 0.560689786585, recall@5: 0.836350355691,
TF-IDF model:
1 in 2: recall@1: 0.749260042283 1 in 10: recall@1: 0.48810782241 recall@2: 0.587315010571 recall@5: 0.763054968288
I get worse results, can you put the data that using your code we can reproduce the same please ?
You can reproduce the results using the blobs from Joelle's website (linked above): http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/ubuntu_blobs.tgz
Maybe you need to use different hyperparameters. In rkadlec's paper they were able to obtain better results
Ok will use these blobs and hopefully I'll get the same results on the paper. Please update your README so people stop downloading the old blobs and use Joelle's ones instead.
They are the same blobs, just hosted on Joelle's site instead of mine which is no longer active.
But you said above they are V1 ? I need data to get the results reported here https://arxiv.org/pdf/1506.08909.pdf using V2
The results in our paper (the one you linked to) is using the V1 dataset. If you run the code in this repo with the blobs from Joelle's site, you will be able to reproduce our results. You won't be able to get exactly the same results as us using V2, but I would expect that if you tune the hyperparameters you can get the same or better results, as shown here: https://arxiv.org/abs/1510.03753
Ok I see, so if I understand well, the paper https://arxiv.org/abs/1510.03753 uses the latest version V2 ? that wont make sens to compare results on V2 with results from V1 as they did in table 1.
No, I just double-checked and they are using V1. My point was just that you will probably need to do some tuning to get better results using V2.
To match the original setup of [1] we use the same training data The dataset in binary format is available at http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_blobs.tgz [accessed 25.9.2015]
Yes they used V1, surprisingly almost all papers citing you paper are reporting results you obtained on V1 of the corpus but use V2 to test theirs
This ACL paper for example http://www.aclweb.org/anthology/P17-1046 they're giving different details in section 5.1 about your ubuntu corpus (0.5M test and dev this is not in V1 nor V2) but at the same time report your results on V1 in table 3 ..
Really confusing
Yeah that's a bit weird. That ACL paper created their own train/val/test set using V1.
This .zip file includes the datasets (training/testint/validation) used in the experiments of paper:
Incorporating Loose-Structured Knowledge into LSTM with Recall-Gate for Conversation Modeling.
The datasets are extracted from the corpus: http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/
Negtive sampling is conducted to produce balanced training set and 1:9 validation/testing sets following the paper of Lowe et al. (2015)
The details of the datasets are give below:
1. train.txt: 1 million training samples (pos:neg=1:1)
2. valid.txt: 50,000 samples for validation (pos:neg=1:9)
3. test.txt: 50,000 samples for testing (pos:neg=1:9)
4. vocab.txt: Vocabulary of the datasets.
Which papers have you seen using V2?
Yes it's weird to compare results on different datasets .. [1] https://arxiv.org/pdf/1605.05110.pdf [2] https://arxiv.org/pdf/1605.00090.pdf
These two papers are using the dataset with 50.000 test and dev samples but see what they said in [1] in the first paragraph of page 7 still comparing non comparable results
I don't see a problem there. In [1] they are running all the experiments on their train/test/split, which is consistent. Our model was included as a baseline, but they didn't quote the results we got in our paper.
Anyway thank you @npow for making things clear. I'm closing the issue.
In the blob file you provide, the size of test and dev are 355170 and 356096 respectively although the original corpus has 189200 and 195600 samples. Can you explain please how did you get that number ?