Issues replicating automatic eval results

sirmammingtonham commented 4 years ago

Hi ratishsp, Thanks for the great paper and especially for the accompanying code!

For some reason I'm having some issues replicating the results from the paper when using the automatic eval scripts. BLEU score is consistent, however I am getting much lower numbers for RG, CS, and CO (I've tested on the template, ws2017 model, and ncp+cc model linked on the github).

For example, when running non_rg_metrics.py -test on the aaai_19_rotowire_test.txt linked on the github, I get RG% = 0.778, CS P% = 0.286, CS R% = 0.014, and CO DLD% = 0.022. I'm getting a similar range of numbers for the rest of the models too.

Just wondering if there is anything else that was done during the score calculation. I followed the "automatic evaluation using IE metrics" section in the README (using the updated IE model with fix for number words and order of relations).

I also have a question about how the RG # is calculated. The extractor.lua gives a "nodup correct" number; do I divide that by the total number of generated summaries?

Thanks for the help, Ethan

sirmammingtonham commented 4 years ago

So I ended up retraining the IE model and after using it with the extractor I managed to replicate the results in the paper. It appears as though that there might some issue with the model linked on your fork of the data2text repo and the current version of lua torch, as the accuracy of extracted records is greatly reduced for some reason.

ratishsp commented 4 years ago

Regarding the question about RG#, yes you divide "nodup correct" by the count of generated summaries. I am unsure about the issues that you mention regarding the model linked on the repo. I may look at it later.

ratishsp / data2text-plan-py

Issues replicating automatic eval results #21