mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.07k stars 164 forks source link

questions about sacrebleu with chinese result #147

Closed wingsyuan closed 3 years ago

wingsyuan commented 3 years ago

Hi, when I use multi-bleu-detok.perl(moses-scripts/scripts/generic/multi-bleu-detok.perl)and sacrebleu to score the translation,here are some results:

multi-bleu-detok: ./tools/moses-scripts/scripts/generic/multi-bleu-detok.perl test1_translate_4k_nospace.zh < test1_st_4k_refer_nospace.zh BLEU = 34.52, 59.4/36.6/29.2/24.3 (BP=0.979, ratio=0.980, hyp_len=9358, ref_len=9552)

sacrebleu: cat test1_translate_4k_nospace.zh | sacrebleu test1_st_4k_refer_nospace.zh BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.5.1 = 34.3 58.2/35.7/28.4/23.6 (BP = 1.000 ratio = 1.021 hyp_len = 9552 ref_len = 9358)

sacrebleu -tok(char,zh)

cat test1_translate_4k_nospace.zh | sacrebleu -tok zh test1_st_4k_refer_nospace.zh BLEU+case.mixed+numrefs.1+smooth.exp+tok.zh+version.1.5.1 = 56.0 75.8/61.2/50.2/42.1 (BP = 1.000 ratio = 1.028 hyp_len = 58659 ref_len = 57050)

cat test1_translate_4k_nospace.zh | sacrebleu -tok char test1_st_4k_refer_nospace.zh BLEU+case.mixed+numrefs.1+smooth.exp+tok.char+version.1.5.1 = 59.9 77.2/64.4/54.6/47.2 (BP = 1.000 ratio = 1.034 hyp_len = 67682 ref_len = 65482)

Here are my questions: 1、Is the format of the content in the translation result and the reference file I supplied correct? should I use the reference file with no tokenization when I score the result? 2、since the score seem different。。what's the diffenent between multi-bleu-detok.perl and sacrebleu? 3、what's the meaning of using -tok option ? how to get the proper score the translation result using sacrebleu?

the former of content in test1_translate_4k_nospace.zh: 我们通过与荷兰癌症登记系统关联的方式收集了两组间癌症次数和肿瘤特征的数据. 在MRI扫描组中,如果适用,那么在筛查时或在6个月重复筛查时发现癌症. 间期癌包括乳腺钼靶检查前诊断为阴性的所有乳腺癌. 如果没有进行乳腺钼靶检查(例如因为年龄<75岁),则定义为乳腺钼靶筛查结果为阴性后24个月内1例诊断为癌症. 这一定义推动了后续乳房X线照相术检测到的癌症间期. 关键次要结局包括其他检查的完成率,MRI检查率,假阳性率,阳性预测值和肿瘤特征. 所有校正率的定义为,接受MRI筛查的所有女性中MRI检查结果阳性的参与者百分比. 在MRI中,BI-RADS评分为3,4或5分被视为阳性.

the former of content in test1_st_4k_refer_nospace.zh: 我们通过荷兰癌症登记系统(NetherlandsCancerRegistry)建立了两组间肿瘤数量和肿瘤特征的数据. 在MRI邀请组,如果进行6个月的筛查或重复筛查,则检出癌症. 间期癌包括在下一次乳房X线检查结果为阴性的乳腺癌. 如果没有计划钼靶检查(例如年龄≥75岁),则间隔在乳腺钼靶检查后24个月内被确诊为癌症. 这一定义推测需要在之后的乳腺钼靶检查中检测到的癌症间隔期. 关键次要结局包括额外检查,MRI的癌症检出率,假阳性率,阳性预测值和肿瘤特征. 回顾性发生率的定义为所有接受MRI筛查的女性中结果阳性的参与者百分比. MRI检查结果为阳性.

ozancaglayan commented 3 years ago

Hi,

Yes you should be giving detokenized files to sacreBLEU.

The reason you get slightly different results between multi-bleu-detok and sacreBLEU is because you are passing the reference as the hypothesis to multi-bleu-detok.perl. The correct command would be: multi-bleu-detok.pl test1_st_4k_refer_nospace.zh < test1_translate_4k_nospace.zh. You should hopefully now get the same numbers.

3、what's the meaning of using -tok option ? how to get the proper score the translation result using sacrebleu?

With default arguments, sacreBLEU and multi-bleu applies the so-called v13a tokenization, which will not perform optimally for Chinese sentences. That is why SacreBLEU offers a Chinese-specific tokenizer which will simply separate all Chinese characters by space and continue using v13a tokenizer for non-Chinese fragments such as BI-RADS in your final line. On the other hand, char tokenizer will separate out everything. It's recommended to use --tokenize zh in your case.

ozancaglayan commented 3 years ago

no answer.

rabeya-akter commented 11 months ago

How can I use custom tokenization for bangla text?