playma / LCSTS2.0-clean

7 stars 1 forks source link

The rouge result is always 2 points lower than your reports #3

Closed zhr1996 closed 5 years ago

zhr1996 commented 5 years ago

Hello @playma, thanks for sharing your model. We tested your model, but our result is always 2 points lower than your reports. We are using LCSTS 2.0. We use pyrouge to compute scores of results. Could you please tell us that if you applied any method to clean the data? Like if there are any methods applied to remove the Chinese punctuations? And may I ask how did you tokenize the raw data?

playma commented 5 years ago

What are your hyperparameters? I have updated the code to github repo

https://github.com/playma/OpenNMT-py

You can use the same hyperparameters in LCSTS1.0 script and LCSTS2.0-clean_script directories

zhr1996 commented 5 years ago

Thanks for your reply. our hyperparameters are as follows: python $OpenNMT_dir/translate.py \ -model $model_path -tgt $trg \ -src $src -output $output \ -beam_size 5 \ -min_length 0 -max_length 100 \ -verbose -gpu 0 \ -batch_size 64

zhr1996 commented 5 years ago

We feel it might be the difference between how we tokenize the source file. Our test file looks like this: 5531556706880_ pic

playma commented 5 years ago

I use word as unit in encoder, and character unit in decoder.

And please calculate the score by this program, I have confirmed with the original author of LCSTS. https://github.com/playma/OpenNMT-py/blob/master/ROUGE_with_ranked.pl

zhr1996 commented 5 years ago

Thank you very much. We think we found the error. And may we ask what tools did you use to tokenize the sentence?

playma commented 5 years ago

I use jieba as the word tokenlizer. Enjoy your research!

zhr1996 commented 5 years ago

Thank you very much for your help. Best luck with your research and work!