evaluation result changes every time

tagucci commented 7 years ago

when I run example as mentioned in README

./ROUGE-WE-1.0.0.pl -x -n 2 -U -2 4 -e rouge_1.5.5_data/ -c 95 -a sample-config.xml

result of ROUGE is different in each time.

# 1st time 
---------------------------------------------
1 ROUGE-1 Average_R: 0.22671 (95%-conf.int. 0.22671 - 0.22671)
1 ROUGE-1 Average_P: 0.26719 (95%-conf.int. 0.26719 - 0.26719)
1 ROUGE-1 Average_F: 0.24529 (95%-conf.int. 0.24529 - 0.24529)
---------------------------------------------
# 2nd time 
---------------------------------------------
1 ROUGE-1 Average_R: 0.26098 (95%-conf.int. 0.26098 - 0.26098)
1 ROUGE-1 Average_P: 0.30758 (95%-conf.int. 0.30758 - 0.30758)
1 ROUGE-1 Average_F: 0.28237 (95%-conf.int. 0.28237 - 0.28237)
---------------------------------------------
# 3rd time
---------------------------------------------
1 ROUGE-1 Average_R: 0.23381 (95%-conf.int. 0.23381 - 0.23381)
1 ROUGE-1 Average_P: 0.27556 (95%-conf.int. 0.27556 - 0.27556)
1 ROUGE-1 Average_F: 0.25297 (95%-conf.int. 0.25297 - 0.25297)
---------------------------------------------

How can I reproduce a exact ROUGE evaluation result in each time?

ng-j-p commented 7 years ago

Thanks for bringing this up. I apologize for not getting back to you on this earlier.

While I look into this, as a temporary fix, would you be able to flush the intermediate directories each time you run an evaluation? I believe this should work, it could be some intermediate files that is created when the first run was made that is messing with the results.

I'd try to take a look and get back with a fix soon.

joewellhe commented 6 years ago

I read your paper, and very interested in your work. So, I want to know what pre-process you have done to compute Rouge score. The pearson score is so high in your work, I try it in AESOP data, but the result is not good as your's work.

jpilaul commented 5 years ago

I am getting varying results as well. Have you fixed the issue?

colby-vickerson commented 5 years ago

The reason you are getting different results is because there is a bug in the sub ngramWord2VecScore. It only calculates word2vec on the first word in the model summary, compared with each word in the peer summary. The dictionary seen_grams is filled after the first pass and never reset, which results in ($seen_grams{$pt} <= $model_grams->{$t}) never being true again. Model_grams is a dictionary, which is unordered in PERL. Therefore, when the keys of model_grams are iterated over, a different one appears first each time. This is where the randomness comes into play. Whatever word comes first will dictate the Rouge-WE score.

Example:

Common words are removed Run 1: Model = the cat ate food Peer = the mat ate food

The screenshot below comes from running Rouge-WE in debug mode (-v arg added). You can see both the model and peer gram, as well as the ordering of the tokens.

This screenshot shows which combination of words are being sent to the python web server. It shows that only cat -> mat, cat -> ate, and cat -> food are having word2vec calculated for them. This happens because cat is the first key in the unordered dictionary model_grams.

Run 2: Model = the cat ate food Peer = the mat ate food

I ran the code a second time, and here you can see the results are different for the same model and peer gram. You can also see that the ordering of model_grams is different, food comes first.

Sure enough, looking at the python web server output shows that word2vec was only run on food -> mat, food -> food, and food -> ate. *This confirms that word2vec is only run on the first word in the model gram.

I would encourage you to stay away from this repo until the bugs are fixed. I am currently working on my own implementation of rouge-we in python that will run much faster because it does not rely on a web server.

*This is not 100% true, there are edge cases where the same word occurs multiple times in model gram and additional combos get word2vec calculated on them.

jpilaul commented 5 years ago

Thanks Colby. Please keep us in the loop on your progress. Cheers

colby-vickerson commented 5 years ago

Would be helpful to get some feedback on how others think rouge-we should be implemented. The current method (minus the bug), takes the sum of all word2vec scores and uses that for the WE part of rouge-we. I was thinking that using the max might work better. In rouge, each ngram is compared to all other ngrams in the other summary; if a match is found a 1 is returned, if not a 0. Rouge-WE would calculate the word2vec score for each ngram combo and the max score would be taken. It would still utilize the dot product of scores for ngrams. As long as the OOV (out of vocabulary) words in the ngram is less than n, that ngram would not be dropped. Instead, perhaps an average vector would be used to represent OOV words (still hashing this out). Using the max score would mean that the rouge-we score should always be higher than the rouge score because the 0's (missing words) are being replaced with the cosine similarity between the closest word in the other summary.

Would like to have a discussion on what others think the best implementation would be.

ng-j-p commented 5 years ago

Hi,

Thank you for taking the time and effort to continue development on this package. I have not been able to spend time on it.

I think you raised a valid suggestion. It is not clear however which approach would be better. The way I evaluated ROUGE-W https://arxiv.org/pdf/1508.06034.pdfE previously was to compare how well it correlates to actual pyramid/responsiveness/readability scores. Evaluation is time-consuming of course. If it is possible, why not introduce this as a parameter and let the user decide between the two approaches?

Jun

On Fri, Jan 4, 2019 at 9:05 AM Colby Vickerson notifications@github.com wrote:

Would be helpful to get some feedback on how others think rouge-we should be implemented. The current method (minus the bug), takes the sum of all word2vec scores and uses that for the WE part of rouge-we. I was thinking that using the max might work better. In rouge, each ngram is compared to all other ngrams in the other summary; if a match is found a 1 is returned, if not a 0. Rouge-WE would calculate the word2vec score for each ngram combo and the max score would be taken. It would still utilize the dot product of scores for ngrams. As long as the OOV (out of vocabulary) words in the ngram is less than n, that ngram would not be dropped. Instead, perhaps an average vector would be used to represent OOV words (still hashing this out). Using the max score would mean that the rouge-we score should always be higher than the rouge score because the 0's (missing words) are being replaced with the cosine similarity between the closest word in the other summary.

Would like to have a discussion on what others think the best implementation would be.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ng-j-p/rouge-we/issues/1#issuecomment-451452218, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgNLmOVbfP_TFxMGCk1tuedeAO3H1BYks5u_1-8gaJpZM4LUgjN .

ng-j-p / rouge-we

evaluation result changes every time #1