jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 177 forks source link

Alignment problem for 30+ seconds silence #228

Closed Trsa993 closed 10 months ago

Trsa993 commented 1 year ago

There is a problem when aligning text when there is silence more than 30 seconds, so when there is not text in chunk

For example:

result = stable_whisper.alignment.align(model, 
                                        "/notebooks/The Unforgiven II.mp3",
                                        text,
                                        regroup=False,
                                        language="en",
                                        demucs=True,
                                        original_spit=True).to_dict()

I get this:

'segments': [{'text': " Lay beside me and tell me what they've done",
   'start': 0.0,
   'end': 18.18,
   'words': [{'text': ' Lay', 'start': 0.0, 'end': 17.86},
    {'text': ' beside', 'start': 17.86, 'end': 18.18},
    {'text': ' me', 'start': 18.18, 'end': 18.18},
    {'text': ' and', 'start': 18.18, 'end': 18.18},
    {'text': ' tell', 'start': 18.18, 'end': 18.18},
    {'text': ' me', 'start': 18.18, 'end': 18.18},
    {'text': ' what', 'start': 18.18, 'end': 18.18},
    {'text': " they've", 'start': 18.18, 'end': 18.18},
    {'text': ' done', 'start': 18.18, 'end': 18.18}]},
  {'text': ' And speak the words I wanna hear, to make my demons run',
   'start': 18.18,
   'end': 18.3,
   'words': [{'text': ' And', 'start': 18.18, 'end': 18.2},
    {'text': ' speak', 'start': 18.2, 'end': 18.22},
    {'text': ' the', 'start': 18.22, 'end': 18.24},
    {'text': ' words', 'start': 18.24, 'end': 18.3},
    {'text': ' I', 'start': 18.3, 'end': 18.3},
    {'text': ' wanna', 'start': 18.3, 'end': 18.3},
    {'text': ' hear,', 'start': 18.3, 'end': 18.3},
    {'text': ' to', 'start': 18.3, 'end': 18.3},
    {'text': ' make', 'start': 18.3, 'end': 18.3},
    {'text': ' my', 'start': 18.3, 'end': 18.3},
    {'text': ' demons', 'start': 18.3, 'end': 18.3},
    {'text': ' run', 'start': 18.3, 'end': 18.3}]}

and it should start after 1 min or so.

Note that the problem occurs whenever there is 30 sec silence, not just at the beginning.

Is it possible to set chunk length for whole audio for alignment (whole mel)?

jianfch commented 1 year ago

This is a limitation of the current alignment algorithm which still being worked on as discussed in https://github.com/jianfch/stable-ts/issues/222#issuecomment-1769845522.