Closed videodanchik closed 1 year ago
We have done that before. Please have a look at https://github.com/k2-fsa/icefall/pull/183#issuecomment-1047583114
Update some results (decoding-method=greedy_search) when removing
for training bpe model: when encoding into a single id, the best result is 6.73%. when not encoding into a single id, the best result is 8.78%. If I don't remove for training bpe model and don't encode into a single id, the best result is 11.11% The above results show that it is necessary to remove
and encode it into a single id.
We have done that before. Please have a look at #183 (comment)
Update some results (decoding-method=greedy_search) when removing for training bpe model: when encoding into a single id, the best result is 6.73%. when not encoding into a single id, the best result is 8.78%. If I don't remove for training bpe model and don't encode into a single id, the best result is 11.11% The above results show that it is necessary to remove and encode it into a single id.
I think what he was talking about is different from https://github.com/k2-fsa/icefall/pull/183#issuecomment-1047583114
. He converted<UNK>
to [▁
, <unk>
], but in https://github.com/k2-fsa/icefall/pull/183#issuecomment-1047583114, <UNK>
was converted to [<
, un
, k
, >
] .
Because if it is it seems not to affect recipes with a low number of
in supervision transcripts, but if there are a lot of like for example in tedlium3 recipe, there might be issues and I believe there were some...
Did you try your changes in tedlium3? Does it make any difference?
https://github.com/k2-fsa/icefall/pull/183#issuecomment-1047583114
when encoding
<unk>
into a single id, the best result is 6.73%.
Use the BPE token <unk>
to replace <UNK>
.
when not encoding
<unk>
into a single id, the best result is 8.78%.
Use two BPE tokens _
<unk>
to replace <UNK>
.
If I don't remove
<unk>
for training bpe model and don't encode<unk>
into a single id, the best result is 11.11%
Use _
, <
un
k
>
to replace <UNK>
.
We have done that before. Please have a look at #183 (comment)
Update some results (decoding-method=greedy_search) when removing for training bpe model: when encoding into a single id, the best result is 6.73%. when not encoding into a single id, the best result is 8.78%. If I don't remove for training bpe model and don't encode into a single id, the best result is 11.11% The above results show that it is necessary to remove and encode it into a single id.
ok thanks that was the PR I remember and here are the problems I am referring to https://github.com/k2-fsa/icefall/pull/183#issuecomment-1048550862. Everything that was done by @luomingshuang there is correct, but I'm talking that there might be a universal issue with all recipes that link to librispeech preprocessing script because you probably forget to tell the model during training that each <UNK>
is a separate word (and each word pronunciation should start with ▁
, take a look in lexicon.txt
in any lang_beh_bpe_*
). And BTW if BPE model by any chance fails to encode some particular word it encodes ['▁', '<unk>']
which means that you can end up with a mixture of your artificial <UNK>
encoded as ['<unk>']
and rare BPE unknown words encoded as ['▁', '<unk>']
during training.
We have done that before. Please have a look at #183 (comment)
Update some results (decoding-method=greedy_search) when removing for training bpe model: when encoding into a single id, the best result is 6.73%. when not encoding into a single id, the best result is 8.78%. If I don't remove for training bpe model and don't encode into a single id, the best result is 11.11% The above results show that it is necessary to remove and encode it into a single id.
I think what he was talking about is different from #183 (comment) . He converted
<UNK>
to [▁
,<unk>
], but in #183 (comment),<UNK>
was converted to [<
,un
,k
,>
] .Because if it is it seems not to affect recipes with a low number of in supervision transcripts, but if there are a lot of like for example in tedlium3 recipe, there might be issues and I believe there were some...
Did you try your changes in tedlium3? Does it make any difference?
right, so I'm working a bit on conformer-ctc
recipe for tedlium3
, but I'm revising the recipe so to leverage the usage of external language models from here: https://kaldi-asr.org/models/m5 and I'm eventually using icefall/egs/tedlium3/ASR/download/tedlium3/TEDLIUM.152k.dic
for the lexicon to make a fair comparison with Kaldi.
when encoding
<unk>
into a single id, the best result is 6.73%.Use the BPE token
<unk>
to replace<UNK>
.when not encoding
<unk>
into a single id, the best result is 8.78%.Use two BPE tokens
_
<unk>
to replace<UNK>
.If I don't remove
<unk>
for training bpe model and don't encode<unk>
into a single id, the best result is 11.11%Use
_
,<
un
k
>
to replace<UNK>
.
Ok, I'm also removing <UNK>
from training texts for BPE model, ok then let me try to do my comparisons later on, may be it's really not an issue.
I suggest you can try to encode <UNK>
or <unk>
into a single id not with sp.encode(text)
.
See https://github.com/k2-fsa/icefall/pull/696 for more details.
Hi, when I was adapting Librispeech for my needs, I haven't even noticed that I changed https://github.com/k2-fsa/icefall/blob/c74cec59e9f6d00e3a5838b4f8d4ace7e2303ad4/egs/librispeech/ASR/local/prepare_lang_bpe.py#L162 to:
lexicon.append(("<UNK>", [▁, sp.id_to_piece(sp.unk_id())]))
I just did it as part of my testing and experimenting, noticing that for unknown wordssentencepiece
outputs▁ <unk>
I just added and forget (and everything seems to work well), and now I realized that it's in the main script for generating BPE lang. So is it a bug or not? Because if it is it seems not to affect recipes with a low number of<UNK>
in supervision transcripts, but if there are a lot of<UNK>
like for example intedlium3
recipe, there might be issues and I believe there were some...