k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
911 stars 292 forks source link

Compile train graph align data #136

Open ChangXiangshi opened 2 years ago

ChangXiangshi commented 2 years ago

i don't know how to generate train graph for aligning data, are there examples?

csukuangfj commented 2 years ago

Yes, please see https://github.com/k2-fsa/icefall/blob/ec591698b07090ff7840868e6b76195d16c246a2/egs/librispeech/ASR/conformer_ctc/ali.py#L211-L212

https://github.com/k2-fsa/icefall/blob/ec591698b07090ff7840868e6b76195d16c246a2/icefall/bpe_graph_compiler.py#L76-L93

csukuangfj commented 2 years ago

Note: Please use the latest k2, i.e., v1.11, to compute framewise alignments as it fixes a bug in https://github.com/k2-fsa/k2/pull/877

ChangXiangshi commented 2 years ago

OK, i'll try

ChangXiangshi commented 2 years ago

@csukuangfj It seems to support bpe or char unit alignment only, how to get different levels of alignments? I just generate transition-id alignment in kaldi, which can map to pdf-id, phone, and word.

csukuangfj commented 2 years ago

There are no transition id or pdf id here, which exist only in HMM/GMM based systems.

If your modelling units are wordpieces, you can get alignment for BPE tokens and words. (Note: To get word alignment, you need to do some extra work. Since a BPE token starting with a _ indicates the beginning of a word, you can get word alignment information from BPE tokens)

If your modelling units are characters, you can get alignment for characters and words. (To get word alignment, you have to assign some attribute to the alignment graph to indicate which arc is the beginning of a word.)

If your modelling units are phones, you can get alignment for phones and words, similar to the one using characters as modelling units.

armusc commented 2 years ago

Hi, just a question, even if I'm not sure I'm asking at the right place. it seems to me you're getting your best WER results (at least on librispeech) with word-piece models rather than phone-based or char-based (which is basically the graphemic phonetisation, right? i.e. the modelling unit is a word letter) but if your decoding lexicon contains OOV words with respect to the training transcripts on which a BPE model was trained, you might decompose those OOV words in word pieces that are not modeled by the BPE model, which is basically never the case with char or phone-based models is the bpe model something recommended with large training corpora and vocabulary sizes? (I was wondering because I' have never trained word-piece models before)

csukuangfj commented 2 years ago

you might decompose those OOV words in word pieces that are not modeled by the BPE model

The lexicon built from the BPE model maps all OOV words to the token, i.e., word piece, <unk>. All pieces of a BPE model, including <unk>, are used during training, I think.


is the bpe model something recommended with large training corpora and vocabulary sizes?

Sorry, I cannot say too much about this. As far as I know, ESPnet and SpeechBrain are all using wordpieces as modeling units. You don't need to be an expert to build a lexicon. It can be trained from data when using BPE models.

armusc commented 2 years ago

There are no transition id or pdf id here, which exist only in HMM/GMM based systems.

If your modelling units are wordpieces, you can get alignment for BPE tokens and words. (Note: To get word alignment, you need to do some extra work. Since a BPE token starting with a _ indicates the beginning of a word, you can get word alignment information from BPE tokens)

I can see that, besides that, it is also necessary to eliminate duplicate tokens that are contiguous in time (i.e. no blank in between, like when we do CTC loss computation) I noted this while using the rescoring with the transformer decoder

csukuangfj commented 2 years ago

I can see that, besides that, it is also necessary to eliminate duplicate tokens that are contiguous in time (i.e. no blank in between, like when we do CTC loss computation) I noted this while using the rescoring with the transformer decoder

Do you mean there are no blanks between repeated symbols in the following tokens?

https://github.com/k2-fsa/icefall/blob/273e5fb2f3ac2620bafdffe2689b8b3ee10173d3/icefall/decode.py#L855

armusc commented 2 years ago

when I do get_alignments(best_path) from the best_path (for each attention and lm scale) of the transformer decoder

sometimes I have repeated tokens with no blanks and to retrieve the correct word I have to count the token only once

csukuangfj commented 2 years ago

https://github.com/k2-fsa/icefall/blob/273e5fb2f3ac2620bafdffe2689b8b3ee10173d3/icefall/utils.py#L227

How do you invoke get_alignments?

armusc commented 2 years ago

i don't know how to generate train graph for aligning data, are there examples?

https://github.com/k2-fsa/icefall/blob/273e5fb2f3ac2620bafdffe2689b8b3ee10173d3/icefall/utils.py#L227

How do you invoke get_alignments?

just like it is shown in conformer_ctc/ali.py ali_ids = get_alignments(best_path)

in decode_one_batch

if best_path_dict is not None:
    for lm_scale_str, best_path in best_path_dict.items():
        hyps = get_texts(best_path)
csukuangfj commented 2 years ago

just like it is shown in conformer_ctc/ali.py

get_alignments requires two arguments. How do you call it?

There is no get_texts() in conformer_ctc/ali.py

armusc commented 2 years ago

just like it is shown in conformer_ctc/ali.py

get_alignments requires two arguments. How do you call it?

There is no get_texts() in conformer_ctc/ali.py

I should have been more specific get_texts and decode_one_batch are in conformer_ctc/decode.py the part I quoted come from the point in the function where rescore_with_attention_decoder is called; it's the point where I called get_alignments to then extract the tokens (and word) alignments for best_path (for each attention scale and lm_scale in the list)

but you're right, I've not updated icefall in the last couple of month (k2 and lhotse, yes, but not icefall); before get_alignments had only best_path as argument I'll download the latest version, sorry