Open bg428 opened 2 years ago
Hi @AnanyaSaiB and @tanay2001 (:
First and foremost, I'd like to congratulate you on the awesome work that you guys did for this paper! I've been trying to catch up with the literature on NLG evaluation and I was excited to read your ideas and evaluation! I truly believe this is the way to go, moving forward and I can only imagine all the effort you put into this research 💯
If it would be possible, I'd like to ask you a small detail! I was cross-referencing the table results for the abstractive summarization with the original results obtained in Fabbri et al. 2020 - Table 2 and I find the results to be a bit different. I wonder whether there's any difference in the computation of the correlations between the two works... A few hypotheses I had is (1) whether you're using both expert and crowdsourced annotations in the computation of the Kendall tau? and (2) whether you're computing the correlation in a summary- or a system-level?
I look forward to hearing back from you!
Thank you so much for your time!
Hi,
I am trying to reproduce the results reported in the paper. In figure 3, sub-figure (d) Data-to-Text Generation, the correlation for BLEURT with “Correctness” is reported as 0.32. I computed the Kendall tau for WebNLG 2020 data using the BLEURT-20 checkpoint given here and found the correlation to be 0.35.
Also, the deviation for BLEURT for “Change numeric values” and “Change names” in figure 4, sub-figure (e) is reported as -0.46 and -0.39 respectively. I found the deviation to be -0.79. Could you please provide the steps to reproduce the results in the paper and the checkpoints used for computing metrics like BLEURT, BERTScore, etc.?
Thanks