Closed zrsjta closed 2 years ago
A problem is encountered during decoding process: the composed HLG has no aux_labels
Can you check that L.fst and G.fst share the same words.txt ?
Perhaps you could also try printing the aux_labels inside compile_HLG at different stages of graph creation, and see when they disappear.
@csukuangfj @danpovey Thanks for the help! The problem has been solved now. Below is a brief description of it.
There is a line in the lexicon generated bylocal/wsj_prepare_dict.sh
: #SHARP-SIGN SH AA1 R P S AY1 N
, and the #SHARP-SIGN
then exists in my words.txt. As a consequence, first_word_disambig_id = find_first_disambig_symbol(self.words)
returns a wrong answer and in compile_HLG
the aux_labels are removed due to: LG.aux_labels.values()[LG.aux_labels.values() >= aux_labels_disambig_id_start] = 0
I have checked that this #SHARP-SIGN
does not exist in the text of train/test_eval92/test_dev93 of WSJ dataset. So it would be safe to just remove this line in the lexicon.
The correct pattern for disambig symbols is given in https://github.com/k2-fsa/snowfall/blob/375dc19dfe54313e00ff1aa22c7f9cd6e9e38b20/snowfall/common.py#L238
def get_phone_symbols(symbol_table: k2.SymbolTable,
pattern: str = r'^#\d+$') -> List[int]:
But find_first_disambig_symbol
in https://github.com/k2-fsa/snowfall/blob/master/snowfall/common.py#L357
def find_first_disambig_symbol(symbols: k2.SymbolTable) -> int:
return min(v for k, v in symbols._sym2id.items() if k.startswith('#'))
does not follow that pattern.
Maybe we should update find_first_disambig_symbol
to allow that normal words can start with #
.
Agreed, we should do that. We should check whether Piotr's issue with deletions in Gigaspeech might be due to that (would only be possible if his lexicon was not sorted, I'd think). @pzelasko
Oh, interesting. I will check it out.
Going forward, maybe we need some sort of validate_lm
procedure / unit test? It could e.g. checks that single-arc FSA created for each vocab word composed with the LM outputs a non-empty FSA, or samples random utterances and composes them, etc.
Mm, validation is good... I'm not sure that I know what types of LMs we'll be using / allowing yet though.
On Mon, Jul 19, 2021 at 11:02 PM Piotr Żelasko @.***> wrote:
Oh, interesting. I will check it out.
Going forward, maybe we need some sort of validate_lm procedure / unit test? It could e.g. checks that single-arc FSA created for each vocab word composed with the LM outputs a non-empty FSA, or samples random utterances and composes them, etc.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/784#issuecomment-882621036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYZUVD7XAJNW7RHETDTYQ47VANCNFSM5ANU6W3A .
But regardless of their type and implementation, surely they all have to be able to execute some common set of operations? I can think of some, e.g.: providing non-zero likelihoods for all the words in the vocab, or for some random utterances sampled from the training set?
Hi team!
I'm trying to implement MMI training in WSJ dataset. A problem is encountered during decoding process: the composed HLG has no aux_labels(all blanks like
[] [] [] []
and so on).Below is how the L and G are prepared. all shell scripts are from the standard Kaldi WSJ recipe (k2_prepare_lang.sh is from snowfall recipe) without any modification.
and to build the HLG:
other information: (1) I print the L.aux_labels and G.aux_labels. It seems that both L and G have their aux_labels. However, HLG.aux_labels are all empty (2) This HLG can also be used for decoding.
lattices
obtained from it has many states and paths but still no aux_labels (of course), which means we cannot get the hypothesis. (3) Follow the snowfall recipe of AISHELL, this problem is not observed. I guess there could be some problem in the L and G building stage.Thanks for help! :)