[BUG] SUMBT MultiWOZ_zh results are not reproducible

nikitacs16 commented 3 years ago

Describe the bug I have downloaded the Chinese Bert model and the pre-trained model for MultiWOZ_zh. There is a mismatch in the joint accuracy. To Reproduce Steps to reproduce the behavior:

Clone the repository. Complete the installation
Download pre-trained model chinese-bert-wwm-ext for MultiWOZ-zh and store under ./pre-trained-models
Download translation-train model for MultiWOZ-zh and store under ./convlab2/dst/sumbt/multiwoz_zh/pre-trained/pytorch_model.bin
Run python3 evaluate.py MultiWOZ-zh sumbt val
Output as follows: {'Joint Acc': 0.3455532926001358, 'Turn Acc': 0.9451874481406302, 'Joint F1': 0.8130141242520675}

Expected behavior The reported joint accuracy is 45.1% for MultiWOZ-zh while the code produces 34.5% instead.

Additional information I have tried to reproduce the experiment and my numbers are in the range of 35% too. I have reproduced the experiment for English MultiWOZ and the joint accuracy for the same is similar to the one posted on the README.

Thanks

zqwerty commented 3 years ago

Thanks! We are checking this.

function2-llx commented 3 years ago

@nikitacs16 Hi, thanks for pointing it out! Now the issue has been fixed with the PR above.

BTW, this script is only used for checking the performance of the baseline. The official evaluation script used in DSTC9 are eval_file.py and eval_model.py under convlab/dst/dstc9. You may also want to check these two files.

Feel free to tell us if you have any other questions.

nikitacs16 commented 3 years ago

Could you give an overview of how to run the evaluation scripts eval_file.py and eval_model.py? Using python3 evaluate.py MultiWOZ-zh sumbt val is straightforward.

Is there a similar way to use the eval_file.py or eval_model.py?

I do not understand the directory structure while using eval_model.py

zqwerty commented 3 years ago

We use eval_file.py and eval_model.py for DSTC-9 evaluation, see https://github.com/ConvLab/ConvLab-2 for details.

nikitacs16 commented 3 years ago

Thanks for the reply.

As far as I understand, the gold standard output and the predicted output should be in the format of the state for the dataset. Then eval_file.py can be run on these files.

Is there an example script to convert the outputs from the SUMBT model into the format for eval_file.py?

nikitacs16 commented 3 years ago

Hi

I am able to reproduce the results with the evaluate.py file.

Thank you for looking into this!

thu-coai / ConvLab-2

[BUG] SUMBT MultiWOZ_zh results are not reproducible #185