Closed houwenxin closed 3 years ago
Thanks! I hope that this code is useful for you.
CTC was modeled after HMMs, that is why a neural network that was trained with CTC is described with HMM terms. HMMs consist of states, transitions and observation probabilities. To get to know better how CTC works, I really can recommend you to read about HMMs and the introduction to CTC in the dissertation of Alex Graves.
Let's view it from a theoretical perspective and say, that a neural network that was trained with CTC outputs characters in the form of state probabilites. For each character, the network has one output neuron. To make such a method work with neural networks, an additional state has to be added: the "blank" state when no other character occurs. When training a CTC network, blank states are put in-between characters. For CTC segmentation, this is optional.
So the output of the neural network indicates in which hidden "character" state the sequence is in - at a given time step. If the state does not change within the next time step, that's a self-transition. For CTC, a self-transition ideally happens in a blank state. For the model that CTC segmentation uses, a self-transition can also happen at a certain character.
The CtcSegmentationParameters()
module has some options that modify the state sequences (e.g., blanks between chars or not) and how the transition probabilities are calculated (self-transition in a blank state is for free etc.). You can adapt these to your needs.
Great, thank you for your patience and the detailed explanation! Maybe I will look into your code for details of the calculation.
Would it be possible to use this algorithm for Frame-Label alignment? For example, if the result is "
For each time step, i.e. frame, a current most probable state is given in the variable state_list
which is returned by ctc_segmentation()
.
Also, timings
give you the frame index at which a character occurs - you could use the frame index differences from the start to the end character of the word to measure its duration.
timings, char_probs, state_list = ctc_segmentation( config, lpz, ground_truth_mat )
From our experiments (see paper), the timings of the CTC network are good. CTC segmentation works well for words, but I am not so sure whether it works that well for single characters, because in regular RNN-based networks, subsampling is used to the extend that almost every or every second frame, a token "occurs" - i.e. the time resolution is not that high. Furthermore, the information of characters within a spoken word is not always distributed linearly.
Great, thank you very much! Maybe word-based vocabulary will lead to better alignment, but some papers show that character-based or subword-based vocabulary will help improve the recognition performance to some extent. Is that true? I think the recognition performance may also affect the alignment performance.
Experiments indicate that up to some extent, if the ASR performance of the CTC-based network is better, the accuracy of the alignments is higher.
In the description I gave you, I assumed that your CTC-based network uses shorter tokens, such as characters. It depends on many factors (ASR system, dataset, language, ...) whether you get better speech recognition performance with longer/shorter/more/less tokens. It is easier for your ASR system to classify when all of the tokens have a similar probability, so subword units can improve the performance......
Okay, I think I have learned a lot, thank you very much~
Thank you for providing this useful toolkit! I am new to it and is learning it, as I know, in ctc means the continuing the last character, then what does the self-transition mean? Can I treat them as the same?