dst evaluation - Githubissues

salesforce / simpletod

Official repository for "SimpleTOD: A Simple Language Model for Task-Oriented Dialogue"

https://arxiv.org/abs/2005.00796

BSD 3-Clause "New" or "Revised" License

235 stars 79 forks source link

dst evaluation #8

Open libing125 opened 3 years ago

libing125 commented 3 years ago

the article reports joint accuracy to be 56.45 on multiwoz, but I can't reproduce the result with default cleaning method. I noticed type_2_noisy_annotations.json in noisy_annotations directory, did you replace original dst annotations with annotations in this file when evaluating?

ShaneTian commented 3 years ago

The results of the latest arXiv paper and NIPS official paper are 55.72, 55.76 and 57.47. Can you reproduce this result?

libing125 commented 3 years ago

sorry, I can't reproduce that, I got 50.46 joint acc.

ShaneTian commented 3 years ago

sorry, I can't reproduce that, I got 50.46 joint acc.

I saw the similar results in https://github.com/salesforce/simpletod/issues/5#issuecomment-734714628 After ignoring both none and dontcare, the results are close to those in the paper, but it is unfair!

libing125 commented 3 years ago

I agree. Thank you !

zhizeng8 commented 3 years ago

Hello, I am also trying to reproduce the result, I notice there are many checkpoints saved, which checkpoint do you use? How do you figure out which one is the best?

jshin49 commented 2 years ago

Based on this commit https://github.com/salesforce/simpletod/commit/917f66afe7f37e75de246949423fc4470a2427c4

They were originally removing both none and dontcare during evaluation (which is probably the paper version) and this schema is unfair compared to the TRADE setting which counts dontcare.

The current version is fixed as the author mentioned.

kwonmha commented 2 years ago

Can someone tell me how to reproduce the dst results?(joint acc around 50 is also fine).

I see no belief states generated when I run generate_dialogue.py as in README. dialogue_aggregated_pred_belief is empty.