Open joewellhe opened 6 years ago
I have the exat same issue, I am not able to reproduce the high correlation scores between ROUGE and the human evaluations reported in the paper.
I get very similar scores to the one provided by OP.
Did you do any preprocessing and if so, is it possible to see this?
I read your paper "Better Summarization Evaluation with Word Embeddings for ROUGE". I'm very interested in your work. I try Rouge-score in the data the same with your, but the pearson score not good as your. e.g. pearson score of rouge2 with Pyr is 0.59 (computed by the matlab script provided by TAC) however, in your paper, this score is 0.96. Why you can get such a high score. If you do the pre-process in TAC data, Could you tell me how you do pre-process.