k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
913 stars 293 forks source link

Possible bug in <UNK> pronunciation. #535

Closed videodanchik closed 1 year ago

videodanchik commented 2 years ago

Hi, when I was adapting Librispeech for my needs, I haven't even noticed that I changed https://github.com/k2-fsa/icefall/blob/c74cec59e9f6d00e3a5838b4f8d4ace7e2303ad4/egs/librispeech/ASR/local/prepare_lang_bpe.py#L162 to: lexicon.append(("<UNK>", [▁, sp.id_to_piece(sp.unk_id())])) I just did it as part of my testing and experimenting, noticing that for unknown words sentencepiece outputs ▁ <unk> I just added and forget (and everything seems to work well), and now I realized that it's in the main script for generating BPE lang. So is it a bug or not? Because if it is it seems not to affect recipes with a low number of <UNK> in supervision transcripts, but if there are a lot of <UNK> like for example in tedlium3 recipe, there might be issues and I believe there were some...

csukuangfj commented 2 years ago

We have done that before. Please have a look at https://github.com/k2-fsa/icefall/pull/183#issuecomment-1047583114

Update some results (decoding-method=greedy_search) when removing for training bpe model: when encoding into a single id, the best result is 6.73%. when not encoding into a single id, the best result is 8.78%. If I don't remove for training bpe model and don't encode into a single id, the best result is 11.11%

The above results show that it is necessary to remove and encode it into a single id.

pkufool commented 2 years ago

We have done that before. Please have a look at #183 (comment)

Update some results (decoding-method=greedy_search) when removing for training bpe model: when encoding into a single id, the best result is 6.73%. when not encoding into a single id, the best result is 8.78%. If I don't remove for training bpe model and don't encode into a single id, the best result is 11.11% The above results show that it is necessary to remove and encode it into a single id.

I think what he was talking about is different from https://github.com/k2-fsa/icefall/pull/183#issuecomment-1047583114 . He converted<UNK> to [, <unk>], but in https://github.com/k2-fsa/icefall/pull/183#issuecomment-1047583114, <UNK> was converted to [<, un, k, >] .

Because if it is it seems not to affect recipes with a low number of in supervision transcripts, but if there are a lot of like for example in tedlium3 recipe, there might be issues and I believe there were some...

Did you try your changes in tedlium3? Does it make any difference?

csukuangfj commented 2 years ago

https://github.com/k2-fsa/icefall/pull/183#issuecomment-1047583114

when encoding <unk> into a single id, the best result is 6.73%.

Use the BPE token <unk> to replace <UNK>.


when not encoding <unk> into a single id, the best result is 8.78%.

Use two BPE tokens _ <unk> to replace <UNK>.


If I don't remove<unk> for training bpe model and don't encode <unk> into a single id, the best result is 11.11%

Use _, < un k > to replace <UNK>.

videodanchik commented 2 years ago

We have done that before. Please have a look at #183 (comment)

Update some results (decoding-method=greedy_search) when removing for training bpe model: when encoding into a single id, the best result is 6.73%. when not encoding into a single id, the best result is 8.78%. If I don't remove for training bpe model and don't encode into a single id, the best result is 11.11% The above results show that it is necessary to remove and encode it into a single id.

ok thanks that was the PR I remember and here are the problems I am referring to https://github.com/k2-fsa/icefall/pull/183#issuecomment-1048550862. Everything that was done by @luomingshuang there is correct, but I'm talking that there might be a universal issue with all recipes that link to librispeech preprocessing script because you probably forget to tell the model during training that each <UNK> is a separate word (and each word pronunciation should start with , take a look in lexicon.txt in any lang_beh_bpe_*). And BTW if BPE model by any chance fails to encode some particular word it encodes ['▁', '<unk>'] which means that you can end up with a mixture of your artificial <UNK> encoded as ['<unk>'] and rare BPE unknown words encoded as ['▁', '<unk>'] during training.

We have done that before. Please have a look at #183 (comment)

Update some results (decoding-method=greedy_search) when removing for training bpe model: when encoding into a single id, the best result is 6.73%. when not encoding into a single id, the best result is 8.78%. If I don't remove for training bpe model and don't encode into a single id, the best result is 11.11% The above results show that it is necessary to remove and encode it into a single id.

I think what he was talking about is different from #183 (comment) . He converted<UNK> to [, <unk>], but in #183 (comment), <UNK> was converted to [<, un, k, >] .

Because if it is it seems not to affect recipes with a low number of in supervision transcripts, but if there are a lot of like for example in tedlium3 recipe, there might be issues and I believe there were some...

Did you try your changes in tedlium3? Does it make any difference?

right, so I'm working a bit on conformer-ctc recipe for tedlium3, but I'm revising the recipe so to leverage the usage of external language models from here: https://kaldi-asr.org/models/m5 and I'm eventually using icefall/egs/tedlium3/ASR/download/tedlium3/TEDLIUM.152k.dic for the lexicon to make a fair comparison with Kaldi.

videodanchik commented 2 years ago

#183 (comment)

when encoding <unk> into a single id, the best result is 6.73%.

Use the BPE token <unk> to replace <UNK>.

when not encoding <unk> into a single id, the best result is 8.78%.

Use two BPE tokens _ <unk> to replace <UNK>.

If I don't remove<unk> for training bpe model and don't encode <unk> into a single id, the best result is 11.11%

Use _, < un k > to replace <UNK>.

Ok, I'm also removing <UNK> from training texts for BPE model, ok then let me try to do my comparisons later on, may be it's really not an issue.

luomingshuang commented 2 years ago

I suggest you can try to encode <UNK> or <unk> into a single id not with sp.encode(text).

videodanchik commented 1 year ago

See https://github.com/k2-fsa/icefall/pull/696 for more details.