kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.11k stars 5.31k forks source link

Mismatch between "wer" and "per_utt" files #4798

Closed huangruizhe closed 1 year ago

huangruizhe commented 1 year ago

Hello, I found the numbers of insertion/deletion/substituion errors are mismatched between "wer" and "per_utt" files, generated by score_kaldi_wer.sh.

I have included a tiny example to replicate the issue. There are 10 utterances in the following two files: hyp.txt ref.txt

cat hyp.text | \
    compute-wer --text --mode=present \
    ark:ref.text ark,p:- \
    > wer

Here is my result:

%WER 24.65 [ 35 / 142, 6 ins, 10 del, 19 sub ] %SER 80.00 [ 8 / 10 ] Scored 10 sentences, 0 not present in hyp.

cat hyp.text | \
    align-text --special-symbol="'***'" ark:ref.text ark:- ark,t:- |  \
    utils/scoring/wer_per_utt_details.pl --special-symbol "'***'" > per_utt

Then, I sum up the numbers of #csid in the "per_utt" file:

grep "#csid" per_utt | awk '{sum_c+=$3; sum_s+=$4; sum_i+=$5; sum_d+=$6;} END{print sum_i+sum_d+sum_s " / " sum_c+sum_d+sum_s ", " sum_i " ins, " sum_d " del, " sum_s " sub";}' 

The result is:

35 / 142, 3 ins, 7 del, 25 sub


Notice above that the two results are not exactly the same:

Is this a known/intended behavior? Thanks!

danpovey commented 1 year ago

Hm. There may be a difference between various versions of the Levenshtein alignment/edit-distance code in edit-distance-inl.h. I don't think it really matters though.

huangruizhe commented 1 year ago

Ok! Yeah, this is true. I will take the result from one of them. Thanks!

jtrmal commented 1 year ago

yeah, it's just ambiguity of the path. y.

On Mon, Sep 26, 2022 at 3:31 AM Daniel Povey @.***> wrote:

Hm. There may be a difference between various versions of the Levenshtein alignment/edit-distance code in edit-distance-inl.h. I don't think it really matters though.

— Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4798#issuecomment-1257601810, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX2CACMRPYCK7G2UPILWAFGOFANCNFSM6AAAAAAQVN5NVI . You are receiving this because you are subscribed to this thread.Message ID: @.***>