Closed nikitacs16 closed 3 years ago
Thanks! We are checking this.
@nikitacs16 Hi, thanks for pointing it out! Now the issue has been fixed with the PR above.
BTW, this script is only used for checking the performance of the baseline. The official evaluation script used in DSTC9 are eval_file.py
and eval_model.py
under convlab/dst/dstc9
. You may also want to check these two files.
Feel free to tell us if you have any other questions.
Could you give an overview of how to run the evaluation scripts eval_file.py and eval_model.py? Using python3 evaluate.py MultiWOZ-zh sumbt val is straightforward.
Is there a similar way to use the eval_file.py or eval_model.py?
I do not understand the directory structure while using eval_model.py
We use eval_file.py and eval_model.py for DSTC-9 evaluation, see https://github.com/ConvLab/ConvLab-2 for details.
Thanks for the reply.
As far as I understand, the gold standard output and the predicted output should be in the format of the state for the dataset. Then eval_file.py can be run on these files.
Is there an example script to convert the outputs from the SUMBT model into the format for eval_file.py?
Hi
I am able to reproduce the results with the evaluate.py file.
Thank you for looking into this!
Describe the bug I have downloaded the Chinese Bert model and the pre-trained model for MultiWOZ_zh. There is a mismatch in the joint accuracy. To Reproduce Steps to reproduce the behavior:
Expected behavior The reported joint accuracy is 45.1% for MultiWOZ-zh while the code produces 34.5% instead.
Additional information I have tried to reproduce the experiment and my numbers are in the range of 35% too. I have reproduced the experiment for English MultiWOZ and the joint accuracy for the same is similar to the one posted on the README.
Thanks