Alignment problem for 30+ seconds silence

There is a problem when aligning text when there is silence more than 30 seconds, so when there is not text in chunk

For example:

result = stable_whisper.alignment.align(model, 
                                        "/notebooks/The Unforgiven II.mp3",
                                        text,
                                        regroup=False,
                                        language="en",
                                        demucs=True,
                                        original_spit=True).to_dict()

I get this:

'segments': [{'text': " Lay beside me and tell me what they've done",
   'start': 0.0,
   'end': 18.18,
   'words': [{'text': ' Lay', 'start': 0.0, 'end': 17.86},
    {'text': ' beside', 'start': 17.86, 'end': 18.18},
    {'text': ' me', 'start': 18.18, 'end': 18.18},
    {'text': ' and', 'start': 18.18, 'end': 18.18},
    {'text': ' tell', 'start': 18.18, 'end': 18.18},
    {'text': ' me', 'start': 18.18, 'end': 18.18},
    {'text': ' what', 'start': 18.18, 'end': 18.18},
    {'text': " they've", 'start': 18.18, 'end': 18.18},
    {'text': ' done', 'start': 18.18, 'end': 18.18}]},
  {'text': ' And speak the words I wanna hear, to make my demons run',
   'start': 18.18,
   'end': 18.3,
   'words': [{'text': ' And', 'start': 18.18, 'end': 18.2},
    {'text': ' speak', 'start': 18.2, 'end': 18.22},
    {'text': ' the', 'start': 18.22, 'end': 18.24},
    {'text': ' words', 'start': 18.24, 'end': 18.3},
    {'text': ' I', 'start': 18.3, 'end': 18.3},
    {'text': ' wanna', 'start': 18.3, 'end': 18.3},
    {'text': ' hear,', 'start': 18.3, 'end': 18.3},
    {'text': ' to', 'start': 18.3, 'end': 18.3},
    {'text': ' make', 'start': 18.3, 'end': 18.3},
    {'text': ' my', 'start': 18.3, 'end': 18.3},
    {'text': ' demons', 'start': 18.3, 'end': 18.3},
    {'text': ' run', 'start': 18.3, 'end': 18.3}]}

and it should start after 1 min or so.

Note that the problem occurs whenever there is 30 sec silence, not just at the beginning.

Is it possible to set chunk length for whole audio for alignment (whole mel)?

jianfch / stable-ts

Alignment problem for 30+ seconds silence #228