Open huangruizhe opened 2 years ago
Let me wait till the othes give opinions about this. I think it's definitely a reasonable option. My only question is how much we will actually be using lexicons in the case of BPE systems, since the main benefit of BPE is not having to use a lexicon.
I was wondering whether we should populate the lexicon by enumerating all possibilities of decomposing a word.
Instead of enumerating all
possibilities, there are works using top-K possibilities.
For instance,
The authors show that it is helpful in improving the WERs.
They also show that it is helpful in training.
Note: There are two methods. (1) Decode the sentence in multiple ways, i.e., sample the whole sentence (2) Decode the word in multiple ways, i.e., each word in a sentence is sampled independently
Thanks for the reference. Yeah, I agree we do not need to enumerate all possibilities. Maybe just the top-k ones, with a constraint that every subword pieces must appear in the token list. (Note that in the Sentence Piece decomposition, the decomposition can have tokens not in the token list -- I remember your code also checked that)
In this way, we can guarantee that every lexicon entry is decodable from ASR posterior.
If you like, maybe I can do some experiments after the conference deadline, with your help and suggestions. It seems the two papers you pointed to are nice starting points to think about this.
@huangruizhe
Yes, you are very welcome to implement it.
https://github.com/k2-fsa/icefall/blob/395a3f952be1449cd7c92b896f4eb9a1c899e2c7/egs/librispeech/ASR/local/prepare_lang_bpe.py#L145-L152
Here, the lexicon is generated by
sp.encode(words, out_type=str)
. For each word, this will generate only one entry. However, the word-to-token mapping may not be one-to-one, e.g., there is an example here:So I was wondering whether we should populate the lexicon by enumerating all possibilities of decomposing a word. This is like the "polyphonic phenomena 多音字" in a typical phonetic lexicon.