k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
897 stars 286 forks source link

Shall we allow the mapping from one word to multiple token sequences in the lexicon? #273

Open huangruizhe opened 2 years ago

huangruizhe commented 2 years ago

https://github.com/k2-fsa/icefall/blob/395a3f952be1449cd7c92b896f4eb9a1c899e2c7/egs/librispeech/ASR/local/prepare_lang_bpe.py#L145-L152

Here, the lexicon is generated by sp.encode(words, out_type=str). For each word, this will generate only one entry. However, the word-to-token mapping may not be one-to-one, e.g., there is an example here:

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']

So I was wondering whether we should populate the lexicon by enumerating all possibilities of decomposing a word. This is like the "polyphonic phenomena 多音字" in a typical phonetic lexicon.

danpovey commented 2 years ago

Let me wait till the othes give opinions about this. I think it's definitely a reasonable option. My only question is how much we will actually be using lexicons in the case of BPE systems, since the main benefit of BPE is not having to use a lexicon.

csukuangfj commented 2 years ago

I was wondering whether we should populate the lexicon by enumerating all possibilities of decomposing a word.

Instead of enumerating all possibilities, there are works using top-K possibilities.

For instance,

Screen Shot 2022-03-28 at 7 27 35 PM

The authors show that it is helpful in improving the WERs.

Screen Shot 2022-03-28 at 7 29 15 PM

They also show that it is helpful in training.

Screen Shot 2022-03-28 at 7 31 37 PM

Note: There are two methods. (1) Decode the sentence in multiple ways, i.e., sample the whole sentence (2) Decode the word in multiple ways, i.e., each word in a sentence is sampled independently

huangruizhe commented 2 years ago

Thanks for the reference. Yeah, I agree we do not need to enumerate all possibilities. Maybe just the top-k ones, with a constraint that every subword pieces must appear in the token list. (Note that in the Sentence Piece decomposition, the decomposition can have tokens not in the token list -- I remember your code also checked that)

In this way, we can guarantee that every lexicon entry is decodable from ASR posterior.

If you like, maybe I can do some experiments after the conference deadline, with your help and suggestions. It seems the two papers you pointed to are nice starting points to think about this.

csukuangfj commented 2 years ago

@huangruizhe

Yes, you are very welcome to implement it.