lumaku / ctc-segmentation

Segment an audio file and obtain utterance alignments. (Python package)
Apache License 2.0
319 stars 29 forks source link

Timing squeezed in the beginning #18

Closed sarapapi closed 2 years ago

sarapapi commented 2 years ago

Dear authors, I tried to use your library to align a true-cased text containing punctuation but I have a problem with the timings obtained because they all seem squeezed in the beginning. I set the index_duration to 0.04 since I extract features every 10ms and I have a subsample of 4 at the beginning. My tokenized textual predictions look like the following: ▁But ▁if ▁you ▁could ▁take ▁a ▁pill ▁ <eol> ▁or ▁a ▁vaccine , ▁ <eob> ▁and ▁just ▁like ▁getting ▁over ▁a ▁cold , ▁ <eob> ▁you ▁could ▁heal ▁your ▁wind ▁faster ? ▁ <eob> Where <eob> and <eol> are treated as special characters in my vocabulary. I select <eob> as a split token i.e., a sentence is split when we found <eob> in the text. The timings obtained are: 0.04-1.52 1.52-2.24 2.24-5.44 And the first thing that is not correct is that the total duration of the segment is 6.75s. I looked at the timings obtained from the ctc segmentation and their last value is 5.44s. The other thing is that, if I compute the interval between the timings obtained by your library I got: 1.48s 0.32s 2.20s but, if I listen to the audio and compute them they are almost: 1.9s 2s 1.7s Also if I look at other examples I can observe the same phenomenon, it seems that all the timinings are squeezed towards the beginning of the sentence. I have used both prepare_text and prepare_token_list but it is not the cause of the problem. Have you any hint on where the problem is? Thank you in advance

lumaku commented 2 years ago

It looks like there is almost one second missing from your timing. Sometimes ASR models apply padding to input sequences which may have a larger impact on short audio sequences, but that is usually not more than 400 ms added to the end or beginning of the audio. Your estimation of the timing may not be accurate enough. Please check:

  1. What is the length of your audio in samples? -> speech_len
  2. What is the length of your network output, i.e. lpz? -> lpz_len
  3. What is your sample rate, e.g. 16000? -> fs

Then, calculate ìndex_duration as follows:

samples_to_frames_ratio = speech_len / lpz_len
index_duration = samples_to_frames_ratio / fs
sarapapi commented 2 years ago

I noticed that my model tended to generate a lot of blanks, hence I discouraged their production at inference time. I had a look at the probability distribution of the tokens of the sentence that I reported to you and now are mostly distributed along with the entire audio. My speech_len is 676 features representing 6760ms of audio and my lpz_len is 169 since I have a subsampling factor of 4. The architecture I used is a Transformer with 2 layers of convolutions at the beginning which compress the audio by a factor of 4. The audio features are calculated as 80-channels filterbank features extracted with a 25ms window size and 10ms shift. The sample rate of my audio is 16kHz. The strange thing is that I obtain the following times now: 0.04-1.56 1.56-2.2800000000000002 2.2800000000000002-3.0 These values, if multiplied by 2, are mostly in sync with the audio (only the last time of the utterance is defined 100ms earlier, as you mentioned before). Then, my question is: why do I need to multiply by 2? Have you ever experienced something similar? Thank you again EDIT: I add that this multiplication factor of 2 allows me to align correctly all the other sentences of the same audio but also of other audios, thus is not an isolated case.

lumaku commented 2 years ago

It is hard to answer your question without specific details of the architecture and preprocessing. My intuition is that there might be additional downsampling in the preprocessing.