Inconsistent BLEU&AL results between different SimulEval versions

Paulmzr commented 5 months ago

Hi, thanks for your great work!

When I try to reproduce the results of edatt, I find the results are inconsistent between using SimulEval v1.0.2 and v1.1.4.

I use the checkpoints provided by BugConformer for must en-de and the global cmvn file from edatt repo.

v1.1
BLEU	21.8	22.6	23.5	24.2
AL	1068	1291	1618	2150

v1.0
BLEU	18.9	20.7	22.2	23.7
AL	1189	1392	1696	2212

sarapapi commented 5 months ago

Hi, thanks a lot :)

Yes, unfortunately the SimulEval tool changes very often and with many breaking changes and I found it difficult to compare results between different versions (and, sometimes, between different commits of the same version). For example, between version 1.0 and 1.1, the tool complexity changed and the agents had to be refactored to make them applicable to this new version. Therefore, I suggest to explicitly add the version and, if possible also the commit, of SimulEval to your work and use the version reported in this repo if you are interested in replicating the results as I did for EdAtt (version) and for the latest work, AlignAtt (version and commit).

In summary, it is impossible to compare the results between the two versions but this does not depend on the specific agent but on the SimulEval tool.

Hope that I helped!

Paulmzr commented 5 months ago

Hi, thanks a lot :)

Yes, unfortunately the SimulEval tool changes very often and with many breaking changes and I found it difficult to compare results between different versions (and, sometimes, between different commits of the same version). For example, between version 1.0 and 1.1, the tool complexity changed and the agents had to be refactored to make them applicable to this new version. Therefore, I suggest to explicitly add the version and, if possible also the commit, of SimulEval to your work and use the version reported in this repo if you are interested in replicating the results as I did for EdAtt (version) and for the latest work, AlignAtt (version and commit).

In summary, it is impossible to compare the results between the two versions but this does not depend on the specific agent but on the SimulEval tool.

Hope that I helped!

Thanks for your kind suggestions :)

I still have a bit of confusion. Which version, v1.0.2 or v1.1.4, do you think would be more reasonable to reproduce EdAtt results and compare with our work. In this EdAtt repo, v1.0.2 is recommended, but from the results above, it seems that the results reproduced using v1.1.4 version are better.

sarapapi commented 5 months ago

I still have a bit of confusion. Which version, v1.0.2 or v1.1.4, do you think would be more reasonable to reproduce EdAtt results and compare with our work. In this EdAtt repo, v1.0.2 is recommended, but from the results above, it seems that the results reproduced using v1.1.4 version are better.

I would say the same version that you are using for evaluating your policy. If you are using version 1.1, then the v1.1 results that you have obtained for EDAtt are the right ones.

Paulmzr commented 5 months ago

Got it. Thanks for your reply.

I still have a bit of confusion. Which version, v1.0.2 or v1.1.4, do you think would be more reasonable to reproduce EdAtt results and compare with our work. In this EdAtt repo, v1.0.2 is recommended, but from the results above, it seems that the results reproduced using v1.1.4 version are better.

I would say the same version that you are using for evaluating your policy. If you are using version 1.1, then the v1.1 results that you have obtained for EDAtt are the right ones.

hlt-mt / FBK-fairseq

Inconsistent BLEU&AL results between different SimulEval versions #7