Closed sarapapi closed 2 years ago
It looks like there is almost one second missing from your timing. Sometimes ASR models apply padding to input sequences which may have a larger impact on short audio sequences, but that is usually not more than 400 ms added to the end or beginning of the audio. Your estimation of the timing may not be accurate enough. Please check:
-> speech_len
-> lpz_len
-> fs
Then, calculate ìndex_duration
as follows:
samples_to_frames_ratio = speech_len / lpz_len
index_duration = samples_to_frames_ratio / fs
I noticed that my model tended to generate a lot of blanks, hence I discouraged their production at inference time. I had a look at the probability distribution of the tokens of the sentence that I reported to you and now are mostly distributed along with the entire audio.
My speech_len
is 676 features representing 6760ms of audio and my lpz_len
is 169 since I have a subsampling factor of 4.
The architecture I used is a Transformer with 2 layers of convolutions at the beginning which compress the audio by a factor of 4. The audio features are calculated as 80-channels filterbank features extracted with a 25ms window size and 10ms
shift.
The sample rate of my audio is 16kHz.
The strange thing is that I obtain the following times now:
0.04-1.56 1.56-2.2800000000000002 2.2800000000000002-3.0
These values, if multiplied by 2, are mostly in sync with the audio (only the last time of the utterance is defined 100ms earlier, as you mentioned before). Then, my question is: why do I need to multiply by 2? Have you ever experienced something similar?
Thank you again
EDIT: I add that this multiplication factor of 2 allows me to align correctly all the other sentences of the same audio but also of other audios, thus is not an isolated case.
It is hard to answer your question without specific details of the architecture and preprocessing. My intuition is that there might be additional downsampling in the preprocessing.
Dear authors, I tried to use your library to align a true-cased text containing punctuation but I have a problem with the timings obtained because they all seem squeezed in the beginning. I set the
index_duration
to 0.04 since I extract features every 10ms and I have a subsample of 4 at the beginning. My tokenized textual predictions look like the following:▁But ▁if ▁you ▁could ▁take ▁a ▁pill ▁ <eol> ▁or ▁a ▁vaccine , ▁ <eob> ▁and ▁just ▁like ▁getting ▁over ▁a ▁cold , ▁ <eob> ▁you ▁could ▁heal ▁your ▁wind ▁faster ? ▁ <eob>
Where<eob>
and<eol>
are treated as special characters in my vocabulary. I select<eob>
as a split token i.e., a sentence is split when we found<eob>
in the text. The timings obtained are:0.04-1.52 1.52-2.24 2.24-5.44
And the first thing that is not correct is that the total duration of the segment is 6.75s. I looked at the timings obtained from the ctc segmentation and their last value is 5.44s. The other thing is that, if I compute the interval between the timings obtained by your library I got:1.48s 0.32s 2.20s
but, if I listen to the audio and compute them they are almost:1.9s 2s 1.7s
Also if I look at other examples I can observe the same phenomenon, it seems that all the timinings are squeezed towards the beginning of the sentence. I have used bothprepare_text
andprepare_token_list
but it is not the cause of the problem. Have you any hint on where the problem is? Thank you in advance