lumaku / ctc-segmentation

Segment an audio file and obtain utterance alignments. (Python package)
Apache License 2.0
320 stars 29 forks source link

What is the difference between <blank> and self-transition? #9

Closed houwenxin closed 3 years ago

houwenxin commented 3 years ago

Thank you for providing this useful toolkit! I am new to it and is learning it, as I know, in ctc means the continuing the last character, then what does the self-transition mean? Can I treat them as the same?

lumaku commented 3 years ago

Thanks! I hope that this code is useful for you.

CTC was modeled after HMMs, that is why a neural network that was trained with CTC is described with HMM terms. HMMs consist of states, transitions and observation probabilities. To get to know better how CTC works, I really can recommend you to read about HMMs and the introduction to CTC in the dissertation of Alex Graves.

Let's view it from a theoretical perspective and say, that a neural network that was trained with CTC outputs characters in the form of state probabilites. For each character, the network has one output neuron. To make such a method work with neural networks, an additional state has to be added: the "blank" state when no other character occurs. When training a CTC network, blank states are put in-between characters. For CTC segmentation, this is optional.

So the output of the neural network indicates in which hidden "character" state the sequence is in - at a given time step. If the state does not change within the next time step, that's a self-transition. For CTC, a self-transition ideally happens in a blank state. For the model that CTC segmentation uses, a self-transition can also happen at a certain character.

The CtcSegmentationParameters() module has some options that modify the state sequences (e.g., blanks between chars or not) and how the transition probabilities are calculated (self-transition in a blank state is for free etc.). You can adapt these to your needs.

houwenxin commented 3 years ago

Great, thank you for your patience and the detailed explanation! Maybe I will look into your code for details of the calculation.

houwenxin commented 3 years ago

Would it be possible to use this algorithm for Frame-Label alignment? For example, if the result is " a b c d", can we regard the 2nd frame as the frame duration of "a", "b " as the frame duration of "b" (which is longer)?

lumaku commented 3 years ago

For each time step, i.e. frame, a current most probable state is given in the variable state_list which is returned by ctc_segmentation(). Also, timings give you the frame index at which a character occurs - you could use the frame index differences from the start to the end character of the word to measure its duration.

timings, char_probs, state_list = ctc_segmentation( config, lpz, ground_truth_mat )

From our experiments (see paper), the timings of the CTC network are good. CTC segmentation works well for words, but I am not so sure whether it works that well for single characters, because in regular RNN-based networks, subsampling is used to the extend that almost every or every second frame, a token "occurs" - i.e. the time resolution is not that high. Furthermore, the information of characters within a spoken word is not always distributed linearly.

houwenxin commented 3 years ago

Great, thank you very much! Maybe word-based vocabulary will lead to better alignment, but some papers show that character-based or subword-based vocabulary will help improve the recognition performance to some extent. Is that true? I think the recognition performance may also affect the alignment performance.

lumaku commented 3 years ago

Experiments indicate that up to some extent, if the ASR performance of the CTC-based network is better, the accuracy of the alignments is higher.

In the description I gave you, I assumed that your CTC-based network uses shorter tokens, such as characters. It depends on many factors (ASR system, dataset, language, ...) whether you get better speech recognition performance with longer/shorter/more/less tokens. It is easier for your ASR system to classify when all of the tokens have a similar probability, so subword units can improve the performance......

houwenxin commented 3 years ago

Okay, I think I have learned a lot, thank you very much~