tomhosking / hrq-vae

Hierarchical Sketch Induction for Paraphrase Generation (Hosking et al., ACL 2022)
MIT License
51 stars 7 forks source link

Training/Dev/Test split: splitforgeneval vs. training-triples #2

Closed guangsen-wang closed 2 years ago

guangsen-wang commented 2 years ago

Hi, thanks for sharing the wonderful project and data. I am trying to use the released data for training my own T5-based paraphrasing model. However, there are multiple sets of train/dev/test.jsonl files under different folders. For example,

for paralex:

  1. wikianswers-para-splitforgeneval
  2. training-triples/wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100/ (BTW, the name is also not the same as specified under the conf folder)

for qqp:

  1. qqp-splitforgeneval
  2. training-triples/qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100/

I also found there might be potential "overlaps" between train and test sets under the same folder, for example,

grep 'Do astrology really work' qqp-splitforgeneval/test.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Do astrology really work?", "paras": ["Dose astrology really work?"]}

VS.

grep 'Dose astrology really work?' qqp-splitforgeneval/train.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Does Rashi prediction really work?", "paras": ["Dose astrology really work?", "Does astrology works?", "Do astrology really work?", "Does astrology really work, I mean the online astrology?"]}

My questions are 1) What is the relationship between qqp-splitforgeneval and training-triples? 2) if I want to compare the results with the paper, which sets should I use, i.e. splitforgeneval or training-triples? (I do not need the "syn_input" utterances) 3) is it safe to assume there are no overlaps among train/dev/eval sets under the same folder? (e.g., Is it possible for a test "sem_input" to appear in train.jsonl under the different folders?)

Thanks and I appreciate your help.

tomhosking commented 2 years ago

Hi, thanks for you interest in our project! And thanks for noticing that the dataset name is different to the config, I will check that.

The four datasets are for different purposes:

So, tldr, to evaluate your model, you should use qqp-splitforgeneval, with the sem_input as input and the paras as references.

It should not be possible for the same sem_input to appear in both train/test. For Paralex, the clusters were created by comparing strings, so should definitely not happen. For MSCOCO, I used the public train/dev/test splits, so again there should not be duplication. For QQP, I used the question IDs to do the clustering, so if the same question appears twice with different IDs then it's possible for it to appear twice. But, please let me know if you find many more examples and I can double check.

guangsen-wang commented 2 years ago

Thanks for the quick reply.

grep -f qqp-splitforgeneval/test_inputs.txt training-triples/qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100/train.jsonl produces a large number of utterances that appear in both the test and train set (again, qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100 is not the same as the one in config (N26)).

Even under the same folder:

grep 'Would Muhammad Ali beat Bruce Lee?' qqp-splitforgeneval/test.jsonl
{"tgt": "Who would win a fight Bruce Lee or Muhammad Ali?", "syn_input": "Who would win a fight Bruce Lee or Muhammad Ali?", "sem_input": "Would Muhammad Ali beat Bruce Lee?", "paras": ["Who would win a fight Bruce Lee or Muhammad Ali?"]}
grep 'Would Muhammad Ali beat Bruce Lee?' qqp-splitforgeneval/train.jsonl
{"tgt": "Would Muhammad Ali beat Bruce Lee?", "syn_input": "Would Muhammad Ali beat Bruce Lee?", "sem_input": "Who would win in a fight, Bruce Lee or Muhammad Ali?", "paras": ["Who's better: Bruce Lee or Muhammad Ali?", "Would Muhammad Ali beat Bruce Lee?", "Who would win a fight Bruce Lee or Muhammad Ali?"]}

Am I missing something?

tomhosking commented 2 years ago

Thanks for bringing this to my attention - this shouldn't be happening! I will check the code that I used to construct the QQP splits and get back to you. The other two datasets look OK, though.

guangsen-wang commented 2 years ago

thanks! For Paralex dataset, there are also 236 utterances out of 27778 that appear in both wikianswers-para-splitforgeneval/test.jsonl and training-triples/wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100/train.jsonl, such as

grep 'what was the name of sacagawea parents' training-triples/wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100/train.jsonl

grep 'what was the name of sacagawea parents' wikianswers-para-splitforgeneval/test.jsonl
{"tgt": "what is sacagawea fathers name ?", "syn_input": "what is sacagawea fathers name ?", "sem_input": "what was the name of sacagawea parents ?", "paras": ["sacagawea fathers name ?", "where did sacagawea family live ?", "sacagawea major childhood events ?", "what is sacagawea dads name ?", "what is sacagawea fathers name ?"]}
tomhosking commented 2 years ago

For QQP, I used some pre-existing train/dev splits and further split dev in dev+test - it looks like unfortunately these splits had overlapping questions! I also can't find exactly where I sourced the splits from.

For Paralex, it's possible there was a bug in my code to build the clusters that meant some clusters were not combined despite having the same sentences. Thanks for drawing both of these to my attention!

Note that all the results reported in the paper used the same dataset splits - so the test scores are probably slightly higher than they should be (due to the train/test leak) but will have affected all the models, so the overall conclusions are still valid.

If you want to create new splits for both datasets I'd be happy to retrain my model and report the updated results? I can also share the code I used to construct the Paralex clusters, if that would be useful?

guangsen-wang commented 2 years ago

Hi, thanks so much for the clarification, really appreciated it. Your clustering code for Paralex would be definitely helpful.

Just to be a little bit more precise, the numbers of train/test leaks are: Paralex 236/27778, MSCOCO 48/5000, QQP 1642/5225. Therefore the impacts to Paralex and MSCOCO are probably negligible. However, for QQP, the results are highly biased as almost 1/3 of the test utterances appear in the training data. What I am planning to do is to remove all the test utterances from the training set and retrain my model: 1) discard all lines that contain test utterances in training-triples/qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100/train/dev.jsonl (sem_inputs, tgt, syn_input) 2) select all unique sem_input and tgt pairs as the new training data for training my own model 3) evaluate on the original test set. Is this a fair setup compared to your model training pipeline? Thanks

tomhosking commented 2 years ago

Yes, that sounds like a sensible approach. I have also added a 'deduped' version of the datasets here that you can use directly - I've removed any instances from the training data that overlap at all with dev or test. I'll also retrain my model on this dataset to check what impact the leak has.

tomhosking commented 2 years ago

I've remembered where the QQP splits came from originally - they're the splits provided by GLUE.

guangsen-wang commented 2 years ago

Thanks, really appreciate the effort. I will definitely try the 'deduped' qqp. Looking forward to your new results on this set as well.

tomhosking commented 2 years ago

The updated HRQ-VAE results on the deduped set are (BLEU/self-BLEU/iBLEU): 30.53/40.22/16.38. So it's true that it does take a performance hit, but these scores are still higher than all the other comparison systems (even when they're trained on the leaky split). So, I'm not concerned about the conclusions in the paper. But I agree it would be better to use the deduped splits going forward :)

hahally commented 2 years ago

Hello, When I trained on the MSCOCO dataset, the result was much worse. Bleu was only 8.x. Why? Including the BTMPG comparison model.

tomhosking commented 2 years ago

Hi @hahally, is your issue related to overlap between train/test splits or is it a different problem?

hahally commented 2 years ago

Thanks for the quick reply.

It may be a different problem. I try to reproduce the experimental results, but Bleu is always low on MSCOCO data, and it performs normally on quora data. I don’t know why?

tomhosking commented 2 years ago

@hahally Please open a separate issue, and provide details of how you are running the model and performing evaluation. Thanks.