Closed Trsa993 closed 10 months ago
There is a problem when aligning text when there is silence more than 30 seconds, so when there is not text in chunk
For example:
result = stable_whisper.alignment.align(model, "/notebooks/The Unforgiven II.mp3", text, regroup=False, language="en", demucs=True, original_spit=True).to_dict()
I get this:
'segments': [{'text': " Lay beside me and tell me what they've done", 'start': 0.0, 'end': 18.18, 'words': [{'text': ' Lay', 'start': 0.0, 'end': 17.86}, {'text': ' beside', 'start': 17.86, 'end': 18.18}, {'text': ' me', 'start': 18.18, 'end': 18.18}, {'text': ' and', 'start': 18.18, 'end': 18.18}, {'text': ' tell', 'start': 18.18, 'end': 18.18}, {'text': ' me', 'start': 18.18, 'end': 18.18}, {'text': ' what', 'start': 18.18, 'end': 18.18}, {'text': " they've", 'start': 18.18, 'end': 18.18}, {'text': ' done', 'start': 18.18, 'end': 18.18}]}, {'text': ' And speak the words I wanna hear, to make my demons run', 'start': 18.18, 'end': 18.3, 'words': [{'text': ' And', 'start': 18.18, 'end': 18.2}, {'text': ' speak', 'start': 18.2, 'end': 18.22}, {'text': ' the', 'start': 18.22, 'end': 18.24}, {'text': ' words', 'start': 18.24, 'end': 18.3}, {'text': ' I', 'start': 18.3, 'end': 18.3}, {'text': ' wanna', 'start': 18.3, 'end': 18.3}, {'text': ' hear,', 'start': 18.3, 'end': 18.3}, {'text': ' to', 'start': 18.3, 'end': 18.3}, {'text': ' make', 'start': 18.3, 'end': 18.3}, {'text': ' my', 'start': 18.3, 'end': 18.3}, {'text': ' demons', 'start': 18.3, 'end': 18.3}, {'text': ' run', 'start': 18.3, 'end': 18.3}]}
and it should start after 1 min or so.
Note that the problem occurs whenever there is 30 sec silence, not just at the beginning.
Is it possible to set chunk length for whole audio for alignment (whole mel)?
This is a limitation of the current alignment algorithm which still being worked on as discussed in https://github.com/jianfch/stable-ts/issues/222#issuecomment-1769845522.
There is a problem when aligning text when there is silence more than 30 seconds, so when there is not text in chunk
For example:
I get this:
and it should start after 1 min or so.
Note that the problem occurs whenever there is 30 sec silence, not just at the beginning.
Is it possible to set chunk length for whole audio for alignment (whole mel)?