lumaku / ctc-segmentation

Segment an audio file and obtain utterance alignments. (Python package)
Apache License 2.0
319 stars 29 forks source link

score_min_mean_over_L question and data preporation #6

Closed ekmb closed 3 years ago

ekmb commented 3 years ago

Hello.

Could you please help with a few questions:

  1. Have you tried different values for score_min_mean_over_L https://github.com/lumaku/ctc-segmentation/blob/master/ctc_segmentation/ctc_segmentation.py#L42? Could you please provide the intuition behind the value and how it is related to the frame duration?

  2. The paper mentions To perform CTC-segmentation on the Librivox corpus, we combined the audio files with the CTC-Segmentation of Large Corpora for German Speech Recognition 9 corresponding ground truth text pieces from Project Gutenberg-DE [6].- do you mean that you combined all audio Librivox pieces into a large audio file? Or did you use the original Librivox audio segments and cut the text into respective pieces? If so, did you manually cut the text or automatically with some text overlap? Any observations on how the algorithm performs with some phrases in the middle of the audio that don't have a corresponding transcript?

Thank you.

lumaku commented 3 years ago

score_min_mean_over_L is the value L in formula (3) in the paper. It has an impact on the confidence score. Character probabilities over each L frames are accumulated. The minimum value over the utterance is taken, that is the probability of the least probable word. A lower L makes the score more sensitive to error in the transcription, but also errors in the ASR model. 30 should be a good value, so that there are maybe three or so characters occurring in a single partition (assumed 4x subsampling).

Librivox: We only inferred one audio file at a time; usually a chapter or a book. The text of this book/chapter was split into utterances at sentence endings to derive utterances, automatically and without overlappings.

Shorter phrases in the middle of the audio should have no big impact; I'd expect that larger unrelated segments deteriorate the results, but have not tested this. You can use the state_list to see how well the unrelated part was "ignored".