Could I get you 300's pairs to check whether I recover the model exactly?

gmftbyGMFTBY / RUBER-and-Bert-RUBER

Implementation of RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

29 stars 3 forks source link

Could I get you 300's pairs to check whether I recover the model exactly? #2

Open li3cmz opened 4 years ago

li3cmz commented 4 years ago

It maybe include sample_300.txt sample_300_tgt.txt and pred.txt.

Looking forward to your reply~

gmftbyGMFTBY commented 4 years ago

Yes, reproduce the performance of RUBER need the human annotation. But I'm sorry that I didn't save them. You can try to annotate the responses by yourself and check the correlation. In my work, I ask three students from BFS(Beijing Foreign School) to annotate the responses.

But I can give you some suggestions:

The BERT-RUBER is much better than other automatic evaluation such as BLEU and ROUGE
Correlation under 0.2 is questionable
During the training, make sure the Acc of the dev and test dataset is higher than 0.6 at least
RUBER's performance is very unstable (I attribute this issue to the Bi-GRU. Replacing the RNN with the BERT embedding will be much better. So I recommand you to use the BERT-RUBER instead of the RUBER).

li3cmz commented 4 years ago

Have you trird on any other dataset? And theirs correlation are higher than 0.2?

gmftbyGMFTBY commented 4 years ago

Yes, I tried BERT-RUBER on four benchmarks that I mentioned before. The correlations with human judgments are around 0.4 which are much better than BLEU, ROUGE, Greedy Matching.

gmftbyGMFTBY commented 4 years ago

Actually, you can try 100 samples to check the performance (100 or 300 samples are all appropriate). I'm so sorry that I didn't save the logs of the annotations.

li3cmz commented 4 years ago

Get it！ Thank you for your help!

gmftbyGMFTBY commented 4 years ago

Okay, feel free to raise the issues when you are in trouble in this project. 😄

gmftbyGMFTBY commented 4 years ago

Oh, I forget something. 0.4 correlation maybe not very precise. Due to the difference among the datasets, the performance of the RUBER or BERT-RUBER are not very stable, and this is the reason that I try to run 10 times and set the final performance as the averaged results.

Actually, you can just simply compare the performance with the word-overlap-based and embedding-based metrics. (BLEU, ROUGE, METEOR, BERTScore and so on. I will push a new commit that contains the other baseline metrics in a few days).

If the performance of BERT-RUBER is much better than them (around 10% correlation improvement, I guess), you can make sure that you reproduce this learning-based metric.

li3cmz commented 4 years ago

ok. Thanks for detail answers. And have you ever tried trained on dailydialog and directly test on other dataset? How about its performance?

li3cmz commented 4 years ago

And the context can only be one speaker?

gmftbyGMFTBY commented 4 years ago

I didn't verify the performance of transfer learning which is shown in RUBER. I will verify this aspect in the future. Actually, I think this experiment is not very essential. If the pretrained score model is released, I think the effectiveness will be proven.
You want to make sure that the learning-based metrics can be applied to multi-turn or multi-party dialogue systems. I think it is very easy. You only need to find a good strategy for encoding the conversation context and I think BERT is still useful.
- Simply feed all the conversation context into the BERT and obtain the sentence embeddings.
- Feed the utterances in the conversation context to obtain each embedding, then add them up.