Open libing125 opened 3 years ago
The results of the latest arXiv paper and NIPS official paper are 55.72, 55.76 and 57.47. Can you reproduce this result?
sorry, I can't reproduce that, I got 50.46 joint acc.
sorry, I can't reproduce that, I got 50.46 joint acc.
I saw the similar results in https://github.com/salesforce/simpletod/issues/5#issuecomment-734714628
After ignoring both none
and dontcare
, the results are close to those in the paper, but it is unfair!
I agree. Thank you !
Hello, I am also trying to reproduce the result, I notice there are many checkpoints saved, which checkpoint do you use? How do you figure out which one is the best?
Based on this commit https://github.com/salesforce/simpletod/commit/917f66afe7f37e75de246949423fc4470a2427c4
They were originally removing both none and dontcare during evaluation (which is probably the paper version) and this schema is unfair compared to the TRADE setting which counts dontcare.
The current version is fixed as the author mentioned.
Can someone tell me how to reproduce the dst results?(joint acc around 50 is also fine).
I see no belief states generated when I run generate_dialogue.py
as in README.
dialogue_aggregated_pred_belief
is empty.
the article reports joint accuracy to be 56.45 on multiwoz, but I can't reproduce the result with default cleaning method. I noticed
type_2_noisy_annotations.json
innoisy_annotations
directory, did you replace original dst annotations with annotations in this file when evaluating?