jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.48k stars 169 forks source link

Suppressing timestamps in silent regions - is the premise correct? #48

Closed ryanheise closed 1 year ago

ryanheise commented 1 year ago

If I understand the implementation correctly, it suppresses any timestamps that fall within a silent region.

But now consider the following scenario where # indicates speech, and | indicates a candidate timestamp for the start of the first word in the segment:

1  2              3     4
|  |              |     |
                   ######## ### ###### #############                  

The most accurate timestamp candidate is number 3, because it is the closest to the boundary between silence and speech, but it happens to fall on the silent side of that boundary by a very small amount, so the best candidate will unfortunately be filtered out.

Now consider timestamps that occur in the middle of the segment, and consider a scenario where the most accurate candidates happen to fall in these places:

                           |   |      | 
                   ######## ### ###### #############                  

If you filter these timestamps out because they just land on the silent side of the boundary, you will actually get less accurate timestamps. And rather than just switch off suppress_middle, I think these silent gaps should be treated as useful signposts as to where word boundaries are likely to be according to the speech signal.

So I am thinking the premise should be flipped on its head. I would think that the boundaries of these silent gaps should act as attractors for good timestamp candidates. And I would go so far as to say that nearby timestamp candidates should be snapped to the boundaries of these silent regions if they are close enough. Let's say, the larger the silent gap and the closer a timestamp is to the boundary, the stronger should be the attraction of a timestamp to that nearby boundary.

Now, there are some words in various languages where you have a glottal stop in the middle of a word, where silence doesn't actually indicate a word boundary, but in general, the larger that gap is, the more likely it is to indicate a word boundary. That's true of even the very large gap at the start of the segment.

A related consideration here is that you don't want to have multiple words snapping to the same signpost from the same side. Even with the current implementation, there may also be a similar issue where just the raw timestamps that you get out of whisper may sometimes cause multiple words colliding in undesirable ways, so that's still an issue in its own right that is worth looking into. Currently I think it ends up merging those words when it really shouldn't. I have encountered examples where "eat. So" was merged into one word, probably because of inaccurate or overlapping timestamps. And a full stop/period is a perfect example of where you might want to use these silent gaps as sign posts to figure out the most likely timestamp for the beginning of the next sentence rather than discarding this information when a timestamp the closest to the boundary but happens to fall on the silent side of it.

A cheap solution would be to just add some padding to these speech regions, but that padding would also end up losing the signal of some of the smaller silent gaps between some words in the middle of a segment, particularly ones where there is a full stop/period in the middle of the segment.

tohe91 commented 1 year ago

I share your observations, especially on multiple words sharing the same timestamp, which occurs very frequently in my use cases. I have set suppress_silence = False and suppress_middle = False to get more reliable results, but of course that just creates other inaccuracies. I think your proposed solution could certainly help to stabilize the results and avoid tight word clusters.

jianfch commented 1 year ago

You raised interesting points. On top of my head, I have three solutions to address some of the issues brought up. Let's consider the two examples you brought up by combining them and making it less favorable for the current suppression logic by making 3rd candidate slightly early.

1  2             3      4  5   6      7
|  |             |      |  |   |      |
                   ######## ### ###### #############                  

1. Average pooling the timestamp probability distribution. Then it becomes something like this where the posts with number on top has the higher probabilities:

1  2             3      4  5   6      7
|||||           |||    |||||| |||    |||
                   ######## ### ###### #############                  

2. Min pooling the suppression mask. This way it only suppresses slightly up to before the start of a sound/speech. Thus it leaves room for timestamps right at the edges to be chosen instead of just ignored (without min pooling the small gaps at 5,6,7 would have been suppressed).

                  3     4  5   6      7
                  |    |||||| |||    |||
                   ######## ### ###### #############                  

3. Suppressing non-gaps and min pooling the suppression mask for that as well.

                  3        5   6      7
                  |       ||| |||    |||
                   ######## ### ###### #############                  

The last method will not work as well for audio that has more than just speech. If the 4th candidate is actually a gap in speech but there is background noise or some other loud sound the occur during the gap then applying this step actually makes it perform worse. I would have the disabled for word-level timestamps

ryanheise commented 1 year ago

Regarding background noise, note that if I used this I would definitely be replacing the suppression mask with actual VAD like silero-vad. I assumed the reason you chose not to use this was that you were just showing a proof-of-concept, but with the real intention to use proper VAD, and making it easy for us to swap out that code for real VAD should we choose to.

Regarding your ideas, I don't exactly understand how to interpret your average pooling and min pooling. Just to be precise here, what are the actual pools/batches that you take the average or min of? In the min pooling case, it sounds like you would have a left-leaning bias, in that it would only help to find the start timestamps. I would like to see some effort to also detect the end timestamps by using the suppression mask or proper VAD. In my approach, each gap (by which I mean a silent, or non-speech region), has TWO boundaries, a left and a right one. The left boundary of a gap is an attractor for the end timestamp of the preceding word, while the right boundary of a gap is an attractor for the start timestamp of the next word. It is sort of like that feature in diagramming tools where you can "snap to grid", where here we are snapping to both the left and right boundaries.

As for "Suppressing non-gaps", I also am not clear how to interpret "non-gap". If what you mean is that you want to suppress background noise that is not speech, then it is the same point as using a proper VAD.

As for average pooling the timestamp probability distribution, I am not really sure how this would work out in practice. It could be that "snap to vad" may still improve accuracy further on top of that.

Finally, regarding the problem where multiple words are merged together especially across sentence boundaries, I also want to suggest that in this situation the Whisper timestamps should probably not be respected at all and can be thrown out. The timestamps could be completely wrong, and yet it would still be safe for us to assume that a "." or a "。" should snap to the left boundary of a gap in the VAD, and the next word should start on the right boundary of that gap.

P.S. An afterthought here. I wonder if these more accurate timestamps are actually fed into the prompt for the next segment to actually help the Whisper model get more on track for the next segment. I don't think it does that.

ryanheise commented 1 year ago

Another thought.

Consider the following audio file with two sentences, again # indicates a probable speech signal inferred by VAD or other means, and | indicate candidate timestamps for the start of the second sentence.

                      1 2             34          5
                      | |             ||          |
               ########   #   #                    ################
                  A       B   C                            D

               I am bob                            Nice to meet you

B and C are false positives in the speech detection. A and D are the correct designations for the first and second sentences.

The following shows how stable-ts can get this wrong:

                      1
                      |
               ########                            ################
                  A                                        B

               I am bob                            Nice to meet you

stable-ts inference:

               ########|###########################################
               I am bob  N  i  c  e    t o    m  e  e  t    y  o  u

For some reason, all the words in the second sentence are spread out. The same thing that occurs in Whisper when no measures are taken to mask the large silence that may exist before the first word in a sentence.

Now it is obvious to us that right after candidate timestamp 1, there is mostly silence for a very long time making it actually a poor candidate. The second sentence has enough speech in it that it would need a sizable amount of speech signal to properly be designated to that region. If we just look at the start timestamp alone, though, we can't see that.

We can estimate the average token duration by taking the total duration of all speech signals and dividing that by the total number of tokens in the audio file as inferred by Whisper. From that we can estimate the expected duration of sentence 2. And using that, we can figure out that candidate 5 is the only start timestamp that has a sufficient amount of voice signal after it to be able to fit the number of inferred tokens for that sentence.

jianfch commented 1 year ago

_ is where the timestamp tokens will be suppressed. # is non-silence. The silence suppression work like this right now:

                      #######    ##### #### ###     ####
______________________       ____     _    _   _____    

After min pooling it will look something like this:

                      #######    ##### #### ###     ####
_____________________         __                ___     

I had considered using VAD in place of the silence mask but adding another network to the pipeline adds more complexity, and the failures of both model will stack. You can never be certain that VAD gives you correct results but you can always determine with certainty whether a part is silent or not.

ryanheise commented 1 year ago

Thanks for the example, the end result looks similar to simple padding (which also has the downside I brought up earlier) but I still don't understand the calculation because I don't understand what are precisely the batches/pools that are being min'ed.

But that still doesn't leverage much of the valuable timestamp data that's available in the vad or silence mask that is described above.

Regarding vad, the problem is that most of the audio I deal with has a longish intro with music and it never works correctly. There can also be different sections of the whole audio that each have their own intro music. Silero-vad is quite reliable and overall improves the accuracy which is the main metric. Either way, the timestamps will still be off sometimes of course, but with Silero-vad, it is comparatively far more accurate, so it's an improvement that doesn't lose any parts of the transcript but just results in more accurate timestamps overall.

jianfch commented 1 year ago

To deal with music, you can use a music source separation model to preprocess the audio (e.g. https://github.com/facebookresearch/demucs)