gmftbyGMFTBY / RUBER-and-Bert-RUBER

Implementation of RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems
29 stars 3 forks source link

Could I get you 300's pairs to check whether I recover the model exactly? #2

Open li3cmz opened 4 years ago

li3cmz commented 4 years ago

It maybe include sample_300.txt sample_300_tgt.txt and pred.txt.

Looking forward to your reply~

gmftbyGMFTBY commented 4 years ago

Yes, reproduce the performance of RUBER need the human annotation. But I'm sorry that I didn't save them. You can try to annotate the responses by yourself and check the correlation. In my work, I ask three students from BFS(Beijing Foreign School) to annotate the responses.

But I can give you some suggestions:

li3cmz commented 4 years ago

Have you trird on any other dataset? And theirs correlation are higher than 0.2?

gmftbyGMFTBY commented 4 years ago

Yes, I tried BERT-RUBER on four benchmarks that I mentioned before. The correlations with human judgments are around 0.4 which are much better than BLEU, ROUGE, Greedy Matching.

gmftbyGMFTBY commented 4 years ago

Actually, you can try 100 samples to check the performance (100 or 300 samples are all appropriate). I'm so sorry that I didn't save the logs of the annotations.

li3cmz commented 4 years ago

Get it! Thank you for your help!

gmftbyGMFTBY commented 4 years ago

Okay, feel free to raise the issues when you are in trouble in this project. 😄

gmftbyGMFTBY commented 4 years ago

Oh, I forget something. 0.4 correlation maybe not very precise. Due to the difference among the datasets, the performance of the RUBER or BERT-RUBER are not very stable, and this is the reason that I try to run 10 times and set the final performance as the averaged results.

Actually, you can just simply compare the performance with the word-overlap-based and embedding-based metrics. (BLEU, ROUGE, METEOR, BERTScore and so on. I will push a new commit that contains the other baseline metrics in a few days).

If the performance of BERT-RUBER is much better than them (around 10% correlation improvement, I guess), you can make sure that you reproduce this learning-based metric.

li3cmz commented 4 years ago

ok. Thanks for detail answers. And have you ever tried trained on dailydialog and directly test on other dataset? How about its performance?

li3cmz commented 4 years ago

And the context can only be one speaker?

gmftbyGMFTBY commented 4 years ago
  1. I didn't verify the performance of transfer learning which is shown in RUBER. I will verify this aspect in the future. Actually, I think this experiment is not very essential. If the pretrained score model is released, I think the effectiveness will be proven.

  2. You want to make sure that the learning-based metrics can be applied to multi-turn or multi-party dialogue systems. I think it is very easy. You only need to find a good strategy for encoding the conversation context and I think BERT is still useful.

    • Simply feed all the conversation context into the BERT and obtain the sentence embeddings.
    • Feed the utterances in the conversation context to obtain each embedding, then add them up.