k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.13k stars 214 forks source link

WSJ: aux_labels disappear after HLG_compose #784

Closed zrsjta closed 2 years ago

zrsjta commented 3 years ago

Hi team!

I'm trying to implement MMI training in WSJ dataset. A problem is encountered during decoding process: the composed HLG has no aux_labels(all blanks like[] [] [] [] and so on).

Below is how the L and G are prepared. all shell scripts are from the standard Kaldi WSJ recipe (k2_prepare_lang.sh is from snowfall recipe) without any modification.

    local/wsj_prepare_dict.sh --dict_suffix _phone
    local/wsj_extend_dict.sh --dict-suffix "_phone" $wsj1/13-32.1

    if [ $unit == "char" ]; then
      local/wsj_prepare_char_dict.sh
      local/wsj_extend_char_dict.sh $wsj1/13-32.1 data/local/dict_char \
                                  data/local/dict_char_larger
    fi

    lang_dict=dict_${unit}_larger
    local/k2_prepare_lang.sh --position-dependent-phones false data/local/$lang_dict \
      "<UNK>" data/local/lang_tmp_nosp $lang || exit 1

    local/wsj_train_lms.sh --dict-suffix _phone
    gunzip -c data/local/local_lm/3gram-mincount/lm_pr6.0.gz > $lang/3grams.arpa
    python3 -m kaldilm \
      --read-symbol-table="data/lang_${unit}/words.txt" \
      --disambig-symbol='#0' \
      --max-order=3  \
      $lang/3grams.arpa > $lang/G.fst.txt

and to build the HLG:

        phone_ids = get_phone_symbols(self.phones) # will remove 0
        phone_ids_with_blank = [0] + phone_ids
        ctc_topo = k2.arc_sort(build_ctc_topo(phone_ids_with_blank))
        if not os.path.exists(self.lang / 'HLG.pt'):
            logging.debug("Loading L_disambig.fst.txt")
            with open(self.lang / 'L_disambig.fst.txt') as f:
                L = k2.Fsa.from_openfst(f.read(), acceptor=False)
            logging.debug("Loading G.fst.txt")
            with open(self.lang / 'G.fst.txt') as f:
                G = k2.Fsa.from_openfst(f.read(), acceptor=False)
            first_phone_disambig_id = find_first_disambig_symbol(self.phones)
            first_word_disambig_id = find_first_disambig_symbol(self.words)
            HLG = compile_HLG(L=L,
                             G=G,
                             H=ctc_topo,
                             labels_disambig_id_start=first_phone_disambig_id,
                             aux_labels_disambig_id_start=first_word_disambig_id)
            torch.save(HLG.as_dict(), self.lang / 'HLG.pt')
        else:
            logging.debug("Loading pre-compiled HLG")
            d = torch.load(self.lang / 'HLG.pt')
            HLG = k2.Fsa.from_dict(d)
        print(HLG.aux_labels)

other information: (1) I print the L.aux_labels and G.aux_labels. It seems that both L and G have their aux_labels. However, HLG.aux_labels are all empty (2) This HLG can also be used for decoding. lattices obtained from it has many states and paths but still no aux_labels (of course), which means we cannot get the hypothesis. (3) Follow the snowfall recipe of AISHELL, this problem is not observed. I guess there could be some problem in the L and G building stage.

Thanks for help! :)

csukuangfj commented 3 years ago

A problem is encountered during decoding process: the composed HLG has no aux_labels

Can you check that L.fst and G.fst share the same words.txt ?

danpovey commented 3 years ago

Perhaps you could also try printing the aux_labels inside compile_HLG at different stages of graph creation, and see when they disappear.

zrsjta commented 3 years ago

@csukuangfj @danpovey Thanks for the help! The problem has been solved now. Below is a brief description of it.

There is a line in the lexicon generated bylocal/wsj_prepare_dict.sh: #SHARP-SIGN SH AA1 R P S AY1 N, and the #SHARP-SIGN then exists in my words.txt. As a consequence, first_word_disambig_id = find_first_disambig_symbol(self.words) returns a wrong answer and in compile_HLG the aux_labels are removed due to: LG.aux_labels.values()[LG.aux_labels.values() >= aux_labels_disambig_id_start] = 0 I have checked that this #SHARP-SIGN does not exist in the text of train/test_eval92/test_dev93 of WSJ dataset. So it would be safe to just remove this line in the lexicon.

csukuangfj commented 3 years ago

The correct pattern for disambig symbols is given in https://github.com/k2-fsa/snowfall/blob/375dc19dfe54313e00ff1aa22c7f9cd6e9e38b20/snowfall/common.py#L238

def get_phone_symbols(symbol_table: k2.SymbolTable,
                      pattern: str = r'^#\d+$') -> List[int]:

But find_first_disambig_symbol in https://github.com/k2-fsa/snowfall/blob/master/snowfall/common.py#L357

def find_first_disambig_symbol(symbols: k2.SymbolTable) -> int:
    return min(v for k, v in symbols._sym2id.items() if k.startswith('#'))

does not follow that pattern.

Maybe we should update find_first_disambig_symbol to allow that normal words can start with #.

danpovey commented 3 years ago

Agreed, we should do that. We should check whether Piotr's issue with deletions in Gigaspeech might be due to that (would only be possible if his lexicon was not sorted, I'd think). @pzelasko

pzelasko commented 3 years ago

Oh, interesting. I will check it out.

Going forward, maybe we need some sort of validate_lm procedure / unit test? It could e.g. checks that single-arc FSA created for each vocab word composed with the LM outputs a non-empty FSA, or samples random utterances and composes them, etc.

danpovey commented 3 years ago

Mm, validation is good... I'm not sure that I know what types of LMs we'll be using / allowing yet though.

On Mon, Jul 19, 2021 at 11:02 PM Piotr Żelasko @.***> wrote:

Oh, interesting. I will check it out.

Going forward, maybe we need some sort of validate_lm procedure / unit test? It could e.g. checks that single-arc FSA created for each vocab word composed with the LM outputs a non-empty FSA, or samples random utterances and composes them, etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/784#issuecomment-882621036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYZUVD7XAJNW7RHETDTYQ47VANCNFSM5ANU6W3A .

pzelasko commented 3 years ago

But regardless of their type and implementation, surely they all have to be able to execute some common set of operations? I can think of some, e.g.: providing non-zero likelihoods for all the words in the vocab, or for some random utterances sampled from the training set?