jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.57k stars 174 forks source link

Two large issues with .align() and .transcribe() #268

Closed torrinworx closed 10 months ago

torrinworx commented 10 months ago

Hi there, just learning about stable-ts for a project of mine, and I've noticed two issues with the transcribe and align functions.

When using align on an mp3 song file, I noticed that the time stamps listed out in result.ori_dict["segments"[0]["words"] are off sync if the audio has gaps of silence in it:

        result = self.model.align(
            audio=file_path,
            text=cleaned_lyrics,
            language='en',
            # verbose=False,
            # regroup=False,  # Use the default regrouping algorithm
            # suppress_silence=True,  # Enable timestamp adjustments based on detected silence
            # suppress_word_ts=True,  # Adjust word timestamps based on detected silence
            # min_word_dur=0.1,  # Minimum word duration
            # q_levels=20,  # Quantization levels for silence detection
            # k_size=5,  # Kernel size for pooling waveform
            vad=True,  # Enable voice activity detection for improved silence/speech discrimination
            # vad_threshold=0.10,  # Threshold for detecting speech with VAD
            # remove_instant_words=False,  # Keep words with very short durations
            # token_step=100,  # Max number of tokens to align each pass
            # original_split=False,  # Do not preserve original segment groupings
            # word_dur_factor=2.0,  # Factor for local max word duration
            # max_word_dur=3.0,  # Global max word duration
            # fast_mode=False,  # Disable fast mode for more accurate alignment
            demucs=True, # Isolate vocals with Demucs; it is also effective at isolating vocals even if there is no music
            min_word_dur=0
        )
        print("Alignment completed. Saving result...")

        words = result.ori_dict["segments"][0]["words"]

.lrc sample output:

[00:02.92] Nobody   <---- In the audio this is said around the 4 second mark
[00:05.87] likes
[00:06.98] you
[00:07.70] Everyone
[00:08.82] left
[00:10.32] you
[00:11.08] Theyre
[00:11.67] all
[00:12.06] out
[00:12.82] without
[00:13.51] you
[00:14.48] Having
[00:15.67] fun             <---- Large gap in singing here for around 20 seconds
[00:17.07] Where        <---- Said around the 40 second mark.
[00:18.76] have
[00:21.42] all
[00:21.42] the
[00:21.42] bastards
[00:21.42] gone

I've tried this on various settings, including the recommended ones in the documentation, however the time stamps seemed to remain constant no matter what I did.

And the transcribe model seems to have another issue, where the time stamps in the "end" key are correctly set for the beginning of the word, but the "start" key of the word is meaningless and starts way before the actual word is said/sung. I don't really know what's going on here either:

Code used:

    def _transcribe(self, file_path):
        print("Lyrics not found. Starting transcription without alignment...")
        result = self.model.transcribe(file_path)
        print("Transcription completed. Saving result...")

        words = result.ori_dict["segments"][0]["words"]

Sample from audio.json output from result.save_as_json():

...
                {
                    "word": " Nobody",
                    "start": 1.0,
                    "end": 5.82,                                        <---- Correct start time when "Nobody" is sung
                    "probability": 0.04231414198875427,
                    "tokens": [
                        9297
                    ],
                    "segment_id": 0,
                    "id": 0
                },
...

Maybe this is just an issue with ori_dict? or some option I haven't set? Feels like I've done something obviously wrong, would really appreciate another set of eyes on this! Love the library!

torrinworx commented 10 months ago

Ok so I tried using the words found in the "segments" key rather than the ori_dict and that seemed to solve the initial problem with start times being off by a few seconds, however the main issue where there are breaks in the singing/words in the audio still isn't accounted for when using settings like VAD and Demucs.

torrinworx commented 10 months ago

My goal here is to generally transcribe any music file with a given input text for lyrics retrieved from an online database for each song. Those lyrics are typically returned in lrc format, which my script strips the timestamps from and just leaves the as the segmented lyrics with line breaks.

That string is fed as text into the align method. However whenever there is an instrumental break in the song without singing, the align method just assumes that the word's after the break happen immediately after the words before the break are sung:

        {
            "start": 4.88,
            "end": 17.46,
            "text": " Nobody likes you Everyone left you They're all out without you Having fun  Where",
            "seek": null,
            "tokens": [
                9297,
                5902,
                291,
                5198,
                1411,
                291,
                814,
                434,
                439,
                484,
                1553,
                291,
                10222,
                1019,
                220,
                2305
            ],
            "temperature": null,
            "avg_logprob": null,
            "compression_ratio": null,
            "no_speech_prob": null,
            "words": [
                {
                    "word": " Nobody",
                    "start": 4.88,
                    "end": 5.88,
                    "probability": 0.5418283939361572,
                    "tokens": [
                        9297
                    ],
                    "segment_id": 0,
                    "id": 0
                },
                {
                    "word": " likes",
                    "start": 5.88,
                    "end": 6.96,
                    "probability": 0.9691749215126038,
                    "tokens": [
                        5902
                    ],
                    "segment_id": 0,
                    "id": 1
                },
                {
                    "word": " you",
                    "start": 6.96,
                    "end": 7.7,
                    "probability": 0.9978371262550354,
                    "tokens": [
                        291
                    ],
                    "segment_id": 0,
                    "id": 2
                },
                {
                    "word": " Everyone",
                    "start": 7.7,
                    "end": 8.82,
                    "probability": 0.3230188488960266,
                    "tokens": [
                        5198
                    ],
                    "segment_id": 0,
                    "id": 3
                },
                {
                    "word": " left",
                    "start": 8.86,
                    "end": 10.34,
                    "probability": 0.8787604570388794,
                    "tokens": [
                        1411
                    ],
                    "segment_id": 0,
                    "id": 4
                },
                {
                    "word": " you",
                    "start": 10.34,
                    "end": 11.08,
                    "probability": 0.9931628704071045,
                    "tokens": [
                        291
                    ],
                    "segment_id": 0,
                    "id": 5
                },
                {
                    "word": " They're",
                    "start": 11.08,
                    "end": 11.68,
                    "probability": 0.9571583271026611,
                    "tokens": [
                        814,
                        434
                    ],
                    "segment_id": 0,
                    "id": 6
                },
                {
                    "word": " all",
                    "start": 11.68,
                    "end": 12.02,
                    "probability": 0.9958831071853638,
                    "tokens": [
                        439
                    ],
                    "segment_id": 0,
                    "id": 7
                },
                {
                    "word": " out",
                    "start": 12.02,
                    "end": 12.8,
                    "probability": 0.9923328161239624,
                    "tokens": [
                        484
                    ],
                    "segment_id": 0,
                    "id": 8
                },
                {
                    "word": " without",
                    "start": 12.8,
                    "end": 13.52,
                    "probability": 0.9911909699440002,
                    "tokens": [
                        1553
                    ],
                    "segment_id": 0,
                    "id": 9
                },
                {
                    "word": " you",
                    "start": 13.52,
                    "end": 14.48,
                    "probability": 0.9982377290725708,
                    "tokens": [
                        291
                    ],
                    "segment_id": 0,
                    "id": 10
                },
                {
                    "word": " Having",
                    "start": 14.48,
                    "end": 15.7,
                    "probability": 0.8765347003936768,
                    "tokens": [
                        10222
                    ],
                    "segment_id": 0,
                    "id": 11
                },
                {
                    "word": " fun",
                    "start": 15.7,
                    "end": 17.08,
                    "probability": 0.9730941653251648,
                    "tokens": [
                        1019
                    ],
                    "segment_id": 0,
                    "id": 12
                },
                {
                    "word": "  Where",
                    "start": 17.08,
                    "end": 17.46,
                    "probability": 0.00011141821192950374,
                    "tokens": [
                        220,
                        2305
                    ],
                    "segment_id": 0,
                    "id": 13
                }
            ],
            "id": 0
        },
torrinworx commented 10 months ago

The .transcribe() method appears to be more accurate in terms of the time stamps of the words even with instrumental breaks or times of non singing/talking.

However it's accuracy of transcribing the audio isn't the best in terms of aligning with the lyrics of the song.

Align is just not accurate at all for some reason in terms of the time stamps. Is there any method that actually works that I'm missing here that could resolve this issue?

jianfch commented 10 months ago

The updated non-speech suppression in 191674beefdddbce026732d5fd93026f85c26772 should help. Try updating stable-ts to 2.14.0+. See https://github.com/jianfch/stable-ts?#silence-suppression.

Another option that can help is to increase the shifts for Demucs with demucs_options=dict(shifts=5) (this will increase processing time). https://github.com/facebookresearch/demucs/blob/e976d93ecc3865e5757426930257e200846a520a/demucs/apply.py#L158-L161

You might also want to make the result deterministic when comparing different runs with demucs=True by setting the same seed each time you transcribe or align.

import random
random.seed(0)
torrinworx commented 10 months ago

Thank you for the response! Ok so I've given that a try and not much luck, still getting the same issue where some words that come after a long pause in speech are still being grouped with the "before pause" speech:

[00:04.99] [00:05.66] Nobody
[00:05.92] [00:06.96] likes
[00:07.49] [00:07.70] you
[00:07.70] [00:08.82] Everyone
[00:08.82] [00:10.33] left
[00:10.33] [00:11.08] you
[00:11.08] [00:11.67] They're
[00:11.67] [00:12.03] all
[00:12.03] [00:12.80] out
[00:12.80] [00:13.51] without
[00:13.51] [00:14.48] you
[00:14.48] [00:15.72] Having
[00:15.72] [00:17.07] fun                              <----- Long pause after this is spoken
[00:17.07] [00:17.91] Where                         <----- Should start at 45.058
[00:18.64] [00:19.85] have
[00:19.85] [00:19.85] all
[00:19.85] [00:19.85] the
[00:19.85] [00:19.85] bastards
[00:19.85] [00:19.85] gone?

The thing is that I am seeing the nonspeech_sections array in the audio.json file after updating the library. However the word timestamps are just not being modified with these values after the first word:

audio.json
> "segments": [ > { > "start": 4.994, > "end": 7.7, > "text": " Nobody likes you", > "seek": null, > "tokens": [ > 9297, > 5902, > 291 > ], > "temperature": null, > "avg_logprob": null, > "compression_ratio": null, > "no_speech_prob": null, > "words": [ > { > "word": " Nobody", > "start": 4.994, > "end": 5.662, > "probability": 0.5390022993087769, > "tokens": [ > 9297 > ], > "segment_id": 0, > "id": 0 > }, > { > "word": " likes", > "start": 5.92, > "end": 6.96, > "probability": 0.9664590954780579, > "tokens": [ > 5902 > ], > "segment_id": 0, > "id": 1 > }, > { > "word": " you", > "start": 7.49, > "end": 7.7, > "probability": 0.997747004032135, > "tokens": [ > 291 > ], > "segment_id": 0, > "id": 2 > } > ], > "id": 0 > }, > { > "start": 7.7, > "end": 11.08, > "text": " Everyone left you", > "seek": null, > "tokens": [ > 5198, > 1411, > 291 > ], > "temperature": null, > "avg_logprob": null, > "compression_ratio": null, > "no_speech_prob": null, > "words": [ > { > "word": " Everyone", > "start": 7.7, > "end": 8.82, > "probability": 0.3145016133785248, > "tokens": [ > 5198 > ], > "segment_id": 1, > "id": 0 > }, > { > "word": " left", > "start": 8.82, > "end": 10.34, > "probability": 0.8860872387886047, > "tokens": [ > 1411 > ], > "segment_id": 1, > "id": 1 > }, > { > "word": " you", > "start": 10.34, > "end": 11.08, > "probability": 0.9943570494651794, > "tokens": [ > 291 > ], > "segment_id": 1, > "id": 2 > } > ], > "id": 1 > }, > { > "start": 11.08, > "end": 14.48, > "text": " They're all out without you", > "seek": null, > "tokens": [ > 814, > 434, > 439, > 484, > 1553, > 291 > ], > "temperature": null, > "avg_logprob": null, > "compression_ratio": null, > "no_speech_prob": null, > "words": [ > { > "word": " They're", > "start": 11.08, > "end": 11.68, > "probability": 0.9578579366207123, > "tokens": [ > 814, > 434 > ], > "segment_id": 2, > "id": 0 > }, > { > "word": " all", > "start": 11.68, > "end": 12.04, > "probability": 0.996057391166687, > "tokens": [ > 439 > ], > "segment_id": 2, > "id": 1 > }, > { > "word": " out", > "start": 12.04, > "end": 12.8, > "probability": 0.9925025701522827, > "tokens": [ > 484 > ], > "segment_id": 2, > "id": 2 > }, > { > "word": " without", > "start": 12.8, > "end": 13.52, > "probability": 0.9914660453796387, > "tokens": [ > 1553 > ], > "segment_id": 2, > "id": 3 > }, > { > "word": " you", > "start": 13.52, > "end": 14.48, > "probability": 0.9983866214752197, > "tokens": [ > 291 > ], > "segment_id": 2, > "id": 4 > } > ], > "id": 2 > }, > { > "start": 14.48, > "end": 17.08, > "text": " Having fun", > "seek": null, > "tokens": [ > 10222, > 1019 > ], > "temperature": null, > "avg_logprob": null, > "compression_ratio": null, > "no_speech_prob": null, > "words": [ > { > "word": " Having", > "start": 14.48, > "end": 15.72, > "probability": 0.8722978830337524, > "tokens": [ > 10222 > ], > "segment_id": 3, > "id": 0 > }, > { > "word": " fun", > "start": 15.72, > "end": 17.08, > "probability": 0.973724901676178, > "tokens": [ > 1019 > ], > "segment_id": 3, > "id": 1 > } > ], > "id": 3 > }, > { > "start": 17.08, > "end": 17.918, > "text": " Where", > "seek": null, > "tokens": [ > 220, > 2305 > ], > "temperature": null, > "avg_logprob": null, > "compression_ratio": null, > "no_speech_prob": null, > "words": [ > { > "word": " Where", > "start": 17.08, > "end": 17.918, > "probability": 0.00011648094644556295, > "tokens": [ > 220, > 2305 > ], > "segment_id": 4, > "id": 0 > } > ], > "id": 4 > }, > { > "start": 18.64, > "end": 19.86, > "text": " have all the bastards gone?", > "seek": null, > "tokens": [ > 362, > 439, > 264, > 49346, > 2780, > 30 > ], > "temperature": null, > "avg_logprob": null, > "compression_ratio": null, > "no_speech_prob": null, > "words": [ > { > "word": " have", > "start": 18.64, > "end": 19.86, > "probability": 0.025606125593185425, > "tokens": [ > 362 > ], > "segment_id": 5, > "id": 0 > }, > { > "word": " all", > "start": 19.86, > "end": 19.86, > "probability": 0.011579000391066074, > "tokens": [ > 439 > ], > "segment_id": 5, > "id": 1 > }, > { > "word": " the", > "start": 19.86, > "end": 19.86, > "probability": 0.39320108294487, > "tokens": [ > 264 > ], > "segment_id": 5, > "id": 2 > }, > { > "word": " bastards", > "start": 19.86, > "end": 19.86, > "probability": 0.00022167911811266094, > "tokens": [ > 49346 > ], > "segment_id": 5, > "id": 3 > }, > { > "word": " gone?", > "start": 19.86, > "end": 19.86, > "probability": 0.8716632425785065, > "tokens": [ > 2780, > 30 > ], > "segment_id": 5, > "id": 4 > } > ], > "id": 5 > }, > ... > "nonspeech_sections": [ > { > "start": 0.0, > "end": 4.994 > }, > { > "start": 5.662, > "end": 7.49 > }, > { > "start": 17.918, > "end": 34.53 > }, > { > "start": 34.814, > "end": 45.058 > }, > { > "start": 47.39, > "end": 47.458 > }, > ... Note that the first nonspeech break of 4.994 is applied to the first segment, but after that it stops for some reason.

The above output is a result of the following setup:

  ...
  import random
random.seed(0)
...

lyrics = """
Nobody likes you
Everyone left you
They're all out without you
Having fun

Where have all the bastards gone?
"""

result = self.model.align(
  audio=file_path,
  text=lyrics,
  language='en',
  vad=True,
  demucs=True,
  demucs_options=dict(shifts=5),
  original_split=True,
  regroup=True,
  suppress_silence=True,
  suppress_word_ts=True,
  nonspeech_error=0.3
)

The full implementation can be found here if your interested: https://github.com/torrinworx/sound-snuggler/blob/a96e7b3bb156ea2ae268cf75eca307ead5cec9b9/scripts/transcription_handler.py#L81

jianfch commented 10 months ago

nonspeech_skip added in 738fd98490584c492cf2f7873bdddaf7a0ec9d40 can help. It will skip the non-speech sections larger than the specify amount. The default is 3 seconds. But keep in mind if nonspeech_skip set too low, it will try to align a bunch of small sections which will perform worse than disabling nonspeech_skip.

The default use_word_position=True (also added in 738fd98490584c492cf2f7873bdddaf7a0ec9d40) will work better if you keep the lines of lyric separated by line breaks and use original_split=True so that it was word positions to work with.

The change that will likely help the most is to use the base model instead of the large-v3. From my limited testing, the larger models hallucinate more than the smaller ones for alignment.

You can also use result.clamp_max() as a final step to clean up the starting timestamps of the segments.

torrinworx commented 10 months ago

Aw dude perfect! I switched to the base model and updated the package, man your awesome thank you so much for the help! Everything is working now