tomhosking / hrq-vae

Hierarchical Sketch Induction for Paraphrase Generation (Hosking et al., ACL 2022)
MIT License
51 stars 7 forks source link

Regarding revaluation with multiple references #9

Closed yafuly closed 1 year ago

yafuly commented 1 year ago

Hi, thanks for your great project!

I wonder how to evaluate with multiple references (e.g., MSCOCO).

BLEU (precision-based) score supports multiple references inherently, but how about Rouge scores?

tomhosking commented 1 year ago

Hi, is there any particular reason you want to use ROUGE instead of BLEU? ROUGE was designed for summarization tasks and isn't generally used in other situations. Note that you should also use iBLEU instead of BLEU since we want output that is different to the input, and a simple copy baseline actually performs very well in terms of just BLEU.

yafuly commented 1 year ago

Thanks for your quick reply.

I'm actually a freshman in paraphrase (bother me if I asked naive questions :) ). I noticed that a line of work (syntax-controlled paraphrase) used Rouge socre as one of their metrics. They also conducted experiments on QQP-pos dataset. Now I understand that you used multi-ref BLEU scores.

I'm struggling to train a simple seq2seq (Transformer) baseline based on your MSCOCO data, and I got very low valid BLEU scores (10.x). Do you have any suggestions?

Thanks again for helping.

tomhosking commented 1 year ago

ROUGE was proposed in this paper for summarization - I haven't seen any evidence that it's better than BLEU for paraphrase evaluation.

I'm sorry but it's difficult to know what the problem might be without knowing a lot more detail about what setup you're running. Which files from the MSCOCO data are you using to train the model? What codebase are you using? One thing to check is whether your BLEU implementation is case sensitive, and whether you are lowercasing your inputs/outputs.

yafuly commented 1 year ago

Yes, I know it's widely used in summarization tasks. The paper I mentioned that used Rouge scores: https://aclanthology.org/2020.tacl-1.22.pdf https://aclanthology.org/2021.emnlp-main.420.pdf

Below are the implementation details: -Model config: transformer base; batch size: 16000*4; -Codebase: Fairseq -Data: mscoco-all -BLEU: I computed case-senstive BLEU using multi-bleu.perl.

Regarding the MSCOCO data, I have several questions: (1) Are all texts lower-cased? (2) How to construct the training instances given one 5 image captions? E.g., are they constructed by traversing all possible two caption combinations? (3) Is the valid set simply a random split of the aforementioned train set?

tomhosking commented 1 year ago

Neither of those papers compare the generated output to the input - this will give misleading results, because based on BLEU alone a copy baseline will outperform their system. So, I don't think their evaluation method is particularly good.

Is your batch size actually 64,000, or is that a typo? This is extremely large! I would try something more like 64.

The dataset mscoco-all does not contain any paraphrase pairs - you probably want to use mscoco-clusters. Or, just use the original dataset.

1) I believe the references are not lower cased, but our model used bert-base-uncased vocab and so produced lowercase output. Therefore for our results, we lower cased the references before calculating BLEU (using sacrebleu). 2) Yes, the full training set is given by all pairwise combinations from each cluster of 5. 3) The valid/test sets are the same as the original MSCOCO splits, except that we (randomly) select one paraphrase from each cluster to use as input, and keep the other 4 as reference outputs. This is the split under mscoco-eval.

yafuly commented 1 year ago

I agree. I think more effort should be put into evaluating the semantic coherence between the input and the hypothesis, with diversity considered.

Sorry, it's a typo. I mean a batch of a maximum of 64,000 tokens (which is a common setting in machine translation). Do you suggest that a smaller batch size works better for paraphrase?

Thanks for your detailed suggestions. Is it safe to use the full training set to train a seq2seq Transformer from scratch, without any initialization from pre-trained models? I wonder how the model learns one-to-many correspondence in a vanilla seq2seq framework.

Yes, I see in your paper that the evaluation is conducted on the official valid split. What is the validation split that is used during training?

tomhosking commented 1 year ago

Yes, I would suggest a smaller batch size - the captions are roughly 10 tokens each and I used a batch size of 64 samples, so try 500-1000 tokens per batch perhaps? It will of course depend on your exact model setup.

Yes, I had no issues training a transformer from scratch, but it wasn't a particularly large mode: 5 layers for encoder and decoder, and dimension 768.

My intuition is that the model does not learn a one-to-many correspondence - it learns to generate the most likely paraphrase. In order to capture to the one-to-many nature of the problem, I think some sort of control mechanism or latent variable is required (eg the sketches that I used in my work).

Sorry, I forgot that the MSCOCO test set is not public. Yes I used the validation split for testing, and split the training set randomly (probably 90/10, I can't remember exactly) to generate new train/valid splits.

I'll close this issue now since it's not directly related to HRQ-VAE - please feel free to open another if you have any problems running my model. Thanks.