Closed torrinworx closed 10 months ago
Ok so I tried using the words found in the "segments" key rather than the ori_dict
and that seemed to solve the initial problem with start times being off by a few seconds, however the main issue where there are breaks in the singing/words in the audio still isn't accounted for when using settings like VAD and Demucs.
My goal here is to generally transcribe any music file with a given input text for lyrics retrieved from an online database for each song. Those lyrics are typically returned in lrc format, which my script strips the timestamps from and just leaves the as the segmented lyrics with line breaks.
That string is fed as text into the align method. However whenever there is an instrumental break in the song without singing, the align method just assumes that the word's after the break happen immediately after the words before the break are sung:
{
"start": 4.88,
"end": 17.46,
"text": " Nobody likes you Everyone left you They're all out without you Having fun Where",
"seek": null,
"tokens": [
9297,
5902,
291,
5198,
1411,
291,
814,
434,
439,
484,
1553,
291,
10222,
1019,
220,
2305
],
"temperature": null,
"avg_logprob": null,
"compression_ratio": null,
"no_speech_prob": null,
"words": [
{
"word": " Nobody",
"start": 4.88,
"end": 5.88,
"probability": 0.5418283939361572,
"tokens": [
9297
],
"segment_id": 0,
"id": 0
},
{
"word": " likes",
"start": 5.88,
"end": 6.96,
"probability": 0.9691749215126038,
"tokens": [
5902
],
"segment_id": 0,
"id": 1
},
{
"word": " you",
"start": 6.96,
"end": 7.7,
"probability": 0.9978371262550354,
"tokens": [
291
],
"segment_id": 0,
"id": 2
},
{
"word": " Everyone",
"start": 7.7,
"end": 8.82,
"probability": 0.3230188488960266,
"tokens": [
5198
],
"segment_id": 0,
"id": 3
},
{
"word": " left",
"start": 8.86,
"end": 10.34,
"probability": 0.8787604570388794,
"tokens": [
1411
],
"segment_id": 0,
"id": 4
},
{
"word": " you",
"start": 10.34,
"end": 11.08,
"probability": 0.9931628704071045,
"tokens": [
291
],
"segment_id": 0,
"id": 5
},
{
"word": " They're",
"start": 11.08,
"end": 11.68,
"probability": 0.9571583271026611,
"tokens": [
814,
434
],
"segment_id": 0,
"id": 6
},
{
"word": " all",
"start": 11.68,
"end": 12.02,
"probability": 0.9958831071853638,
"tokens": [
439
],
"segment_id": 0,
"id": 7
},
{
"word": " out",
"start": 12.02,
"end": 12.8,
"probability": 0.9923328161239624,
"tokens": [
484
],
"segment_id": 0,
"id": 8
},
{
"word": " without",
"start": 12.8,
"end": 13.52,
"probability": 0.9911909699440002,
"tokens": [
1553
],
"segment_id": 0,
"id": 9
},
{
"word": " you",
"start": 13.52,
"end": 14.48,
"probability": 0.9982377290725708,
"tokens": [
291
],
"segment_id": 0,
"id": 10
},
{
"word": " Having",
"start": 14.48,
"end": 15.7,
"probability": 0.8765347003936768,
"tokens": [
10222
],
"segment_id": 0,
"id": 11
},
{
"word": " fun",
"start": 15.7,
"end": 17.08,
"probability": 0.9730941653251648,
"tokens": [
1019
],
"segment_id": 0,
"id": 12
},
{
"word": " Where",
"start": 17.08,
"end": 17.46,
"probability": 0.00011141821192950374,
"tokens": [
220,
2305
],
"segment_id": 0,
"id": 13
}
],
"id": 0
},
The .transcribe() method appears to be more accurate in terms of the time stamps of the words even with instrumental breaks or times of non singing/talking.
However it's accuracy of transcribing the audio isn't the best in terms of aligning with the lyrics of the song.
Align is just not accurate at all for some reason in terms of the time stamps. Is there any method that actually works that I'm missing here that could resolve this issue?
The updated non-speech suppression in 191674beefdddbce026732d5fd93026f85c26772 should help. Try updating stable-ts to 2.14.0+. See https://github.com/jianfch/stable-ts?#silence-suppression.
Another option that can help is to increase the shifts
for Demucs with demucs_options=dict(shifts=5)
(this will increase processing time).
https://github.com/facebookresearch/demucs/blob/e976d93ecc3865e5757426930257e200846a520a/demucs/apply.py#L158-L161
You might also want to make the result deterministic when comparing different runs with demucs=True
by setting the same seed each time you transcribe or align.
import random
random.seed(0)
Thank you for the response! Ok so I've given that a try and not much luck, still getting the same issue where some words that come after a long pause in speech are still being grouped with the "before pause" speech:
[00:04.99] [00:05.66] Nobody
[00:05.92] [00:06.96] likes
[00:07.49] [00:07.70] you
[00:07.70] [00:08.82] Everyone
[00:08.82] [00:10.33] left
[00:10.33] [00:11.08] you
[00:11.08] [00:11.67] They're
[00:11.67] [00:12.03] all
[00:12.03] [00:12.80] out
[00:12.80] [00:13.51] without
[00:13.51] [00:14.48] you
[00:14.48] [00:15.72] Having
[00:15.72] [00:17.07] fun <----- Long pause after this is spoken
[00:17.07] [00:17.91] Where <----- Should start at 45.058
[00:18.64] [00:19.85] have
[00:19.85] [00:19.85] all
[00:19.85] [00:19.85] the
[00:19.85] [00:19.85] bastards
[00:19.85] [00:19.85] gone?
The thing is that I am seeing the nonspeech_sections
array in the audio.json file after updating the library. However the word timestamps are just not being modified with these values after the first word:
The above output is a result of the following setup:
...
import random
random.seed(0)
...
lyrics = """
Nobody likes you
Everyone left you
They're all out without you
Having fun
Where have all the bastards gone?
"""
result = self.model.align(
audio=file_path,
text=lyrics,
language='en',
vad=True,
demucs=True,
demucs_options=dict(shifts=5),
original_split=True,
regroup=True,
suppress_silence=True,
suppress_word_ts=True,
nonspeech_error=0.3
)
The full implementation can be found here if your interested: https://github.com/torrinworx/sound-snuggler/blob/a96e7b3bb156ea2ae268cf75eca307ead5cec9b9/scripts/transcription_handler.py#L81
nonspeech_skip
added in 738fd98490584c492cf2f7873bdddaf7a0ec9d40 can help. It will skip the non-speech sections larger than the specify amount. The default is 3 seconds. But keep in mind if nonspeech_skip
set too low, it will try to align a bunch of small sections which will perform worse than disabling nonspeech_skip
.
The default use_word_position=True
(also added in 738fd98490584c492cf2f7873bdddaf7a0ec9d40) will work better if you keep the lines of lyric separated by line breaks and use original_split=True
so that it was word positions to work with.
The change that will likely help the most is to use the base
model instead of the large-v3
. From my limited testing, the larger models hallucinate more than the smaller ones for alignment.
You can also use result.clamp_max() as a final step to clean up the starting timestamps of the segments.
Aw dude perfect! I switched to the base model and updated the package, man your awesome thank you so much for the help! Everything is working now
Hi there, just learning about stable-ts for a project of mine, and I've noticed two issues with the transcribe and align functions.
When using align on an mp3 song file, I noticed that the time stamps listed out in
result.ori_dict["segments"[0]["words"]
are off sync if the audio has gaps of silence in it:.lrc sample output:
I've tried this on various settings, including the recommended ones in the documentation, however the time stamps seemed to remain constant no matter what I did.
And the transcribe model seems to have another issue, where the time stamps in the "end" key are correctly set for the beginning of the word, but the "start" key of the word is meaningless and starts way before the actual word is said/sung. I don't really know what's going on here either:
Code used:
Sample from audio.json output from result.save_as_json():
Maybe this is just an issue with ori_dict? or some option I haven't set? Feels like I've done something obviously wrong, would really appreciate another set of eyes on this! Love the library!