Open li3cmz opened 4 years ago
Yes, reproduce the performance of RUBER need the human annotation. But I'm sorry that I didn't save them. You can try to annotate the responses by yourself and check the correlation. In my work, I ask three students from BFS(Beijing Foreign School) to annotate the responses.
But I can give you some suggestions:
Have you trird on any other dataset? And theirs correlation are higher than 0.2?
Yes, I tried BERT-RUBER on four benchmarks that I mentioned before. The correlations with human judgments are around 0.4 which are much better than BLEU, ROUGE, Greedy Matching.
Actually, you can try 100 samples to check the performance (100 or 300 samples are all appropriate). I'm so sorry that I didn't save the logs of the annotations.
Get it! Thank you for your help!
Okay, feel free to raise the issues when you are in trouble in this project. 😄
Oh, I forget something. 0.4 correlation maybe not very precise. Due to the difference among the datasets, the performance of the RUBER or BERT-RUBER are not very stable, and this is the reason that I try to run 10 times and set the final performance as the averaged results.
Actually, you can just simply compare the performance with the word-overlap-based and embedding-based metrics. (BLEU, ROUGE, METEOR, BERTScore and so on. I will push a new commit that contains the other baseline metrics in a few days).
If the performance of BERT-RUBER is much better than them (around 10% correlation improvement, I guess), you can make sure that you reproduce this learning-based metric.
ok. Thanks for detail answers. And have you ever tried trained on dailydialog and directly test on other dataset? How about its performance?
And the context can only be one speaker?
I didn't verify the performance of transfer learning which is shown in RUBER. I will verify this aspect in the future. Actually, I think this experiment is not very essential. If the pretrained score model is released, I think the effectiveness will be proven.
You want to make sure that the learning-based metrics can be applied to multi-turn or multi-party dialogue systems. I think it is very easy. You only need to find a good strategy for encoding the conversation context and I think BERT is still useful.
It maybe include sample_300.txt sample_300_tgt.txt and pred.txt.
Looking forward to your reply~