pcgreat / SeqMatchSeq

Compare-Aggregate method for WikiQA (via PyTorch)
29 stars 7 forks source link

MAP performance on Dev #1

Open vikas95 opened 6 years ago

vikas95 commented 6 years ago

Hi,

I am not sure if I am running the codes correctly, but I am just getting atmost 62% MAP on dev set after 10 epochs.

Can you suggest me the required changes to reach ~72% MAP performance...

Thanks...

pcgreat commented 6 years ago

I am afraid you are right. I used to reach ~72% via the given random seed on an old version of pytorch, but now with the new version of pytorch, I wasn't able to reproduce the result. My personal opinion is that the model is neither deep or sophisticated, and usually for such kind of model, tuning hyper parameters will change the results a lot (although I don't think it's worthy to invest time tweaking an unstable model structure). If you want guaranteed decent accuracy on answer selection task, I suggest you take a look at those transfer learning methods from reading comprehension. One of them is here https://github.com/pcgreat/qa-transfer

UPDATE: If you use pytorch 0.1.12 on commit 7fe06d33ba9fae8688a0cff9724417717548cf8b, the accuracy is still ~72%. But if you are on pytorch 0.4, the accuracy is much lower, like 62%. I don't have time to investigate why that would happen, but if anyone figures out the reason, please let me know.

aneesh-joshi commented 6 years ago

I am getting similar MAP values as @vikas95 @pcgreat QA-Transfer sounds interesting. The 0.8 rank is quite remarkable. But does it transfer well to general scenarios?

Are there any other models which might be interesting. I am looking for the "best" Similarity Learning model to implement. With models being unstable or non-reproducible, it's pretty hard to judge.

I was able to get 0.65 MAP with the MatchPyramid model. I couldn't reproduce the Bilateral Multi-Perspective Matching for Natural Language Sentences results.

nadiiach commented 6 years ago

For me the best is MAP: 0.5915335634383253, MRR: 0.6003997765902527

So what's the issue? Why in the paper they claim MAP 0.7433 MRR 0.7545 How is it possible? Yes, the model does not seem to be deep or sophisticated, but this is a non-trivial difference between what they claim and what we get. Did you use the same hyperparams?

aneesh-joshi commented 6 years ago

@N-A-D-I-A A few points of difference. I ran the original lua implementation and found the author is claiming 0.743 on the dev set, not the test set! (I know this issue is about the dev set. But my point is, if you think you'll get great test accuracy with this model, the author's own implementation showed MAP of 0.69 on WikiQA)

nadiiach commented 6 years ago

@aneesh-joshi I see, so original lua implementation gives 0.69 on a test set? Sad. Thanks for letting me know!

aneesh-joshi commented 6 years ago

@N-A-D-I-A

@aneesh-joshi I see, so original lua implementation gives 0.69 on a test set?

Yes. At least when I ran it. The code has a seed set as well. So, I doubt it has much to do with stochcasticity.

As @pcgreat said, it might be better to invest in QA-Transfer. It claims 0.83 (on a 12 model ensemble) and 0.79 (on a single model) for WikiQA. I have tried reproducing the results here. Sadly, I only managed to get 0.64.

nadiiach commented 6 years ago

@aneesh-joshi thank you for letting me know. It is really helpful. Weird that even transfer learning is not reproducible.

Ideally I would like to find a reproducible sota model for WikiQA that does not rely on additional datasets (e.g no transfer learning). Please, let me know if you are aware of any :)

aneesh-joshi commented 6 years ago

@N-A-D-I-A

Weird that even transfer learning is not reproducible

It's quite possible that the model actually works but I implemented it wrong. The problem with reproducibilty is that different people have libraries and versions.

Ideally I would like to find a reproducible sota model

Haha! Wouldn't we all want that! Coincidentally, I am also looking for the very same thing on WikiQA. (with no success :( )

If you don't want dataset reliance, I suggest you look at BiMPM. It claims 0.71 on WikiQA. (don't know whether it's dev or test!) As usual, I was unable to reproduce it! Here and here my implementation. (ignore the comments) Maybe I implemented it wrong(?) You should also take a look at the MatchZoo repo. They have made several models and benchmarks (including BiMPM). But their best model gets only about 0.65 on WikiQA.

Also, consider that if you represent each sentence with an average of word vectors (from pretrained Glove), and take cosine similarity to get the scores, you get MAP of 0.62. So, most of these models are useless (if the implementations are fair and correct). You can read more at https://aneesh-joshi.github.io

aneesh-joshi commented 6 years ago

@N-A-D-I-A
Apparently not. The SeqMatchSeq report is on the test split. I asked the author: https://github.com/shuohangwang/SeqMatchSeq/issues/11#issuecomment-410465669

pcgreat commented 6 years ago

If you use pytorch 0.1.12 on commit 7fe06d3, the accuracy is ~72%. But if you are on pytorch 0.4, the accuracy is much lower, like 62%. I don't have time to investigate why that happens, but if anyone figures out the reason, please let me know.