Open Oleg-A-LLIto opened 1 year ago
Hey man, did you ever find a forced alignment solution that works? I'm also using 11labs and trying to do some forced alignment to fix the subtitles I'm generating.
@jahamed Yep, pyfoal works like a charm for me. Takes some headache to set up, but it's extremely good, at least with 11labs' output. Glad to save you all that time it took me to get there lmao
Thanks @Oleg-A-LLIto! Yes seems like a pain to setup, I got a decent working example with Gentle Forced Aligner too, runs easier on a Mac. Very surprised there isn't an easier & more modern way to get this stuff working (at least in node).
@jahamed Yep, pyfoal works like a charm for me. Takes some headache to set up, but it's extremely good, at least with 11labs' output. Glad to save you all that time it took me to get there lmao
Hey man I found another good library for this, a lot more modern + easier to use. Alignment is very good, thought you should know, It's working for me perfectly now. https://github.com/echogarden-project/echogarden
For Word level timestmaps, you should use whisperX with aeneas. Aeneas is very good for forced alignment with transcript, whisperX is perfect for words timestamps.
Get the aeneas result, transform data for whisperX align model, profit.
By just how bad it is, I'm guessing this is not how Aeneas normally is, so what could be a problem causing generally bad performance? I'm not getting any errors, I'm running win11 and I process fairly small chunks of text.
You might want to use the --debug
flag to investigate.
I just noticed some pretty rough results with a build that was falling back to python + subprocess for speech synthesis, but got much better results with one using the compiled cew
extension. The ~good version of this looks something like:
[DEBU] 2023-10-14 21:37:58.570971 ExecuteTask: Setting synthesizer...
[DEBU] 2023-10-14 21:37:58.571019 Synthesizer: Selecting TTS engine...
[DEBU] 2023-10-14 21:37:58.571061 Synthesizer: TTS engine: eSpeak
[DEBU] 2023-10-14 21:37:58.571105 ESPEAKTTSWrapper: No tts_path specified in rconf, setting default TTS path
[DEBU] 2023-10-14 21:37:58.571130 ESPEAKTTSWrapper: TTS path is espeak
[DEBU] 2023-10-14 21:37:58.571145 ESPEAKTTSWrapper: TTS cache? False
[DEBU] 2023-10-14 21:37:58.571158 ESPEAKTTSWrapper: Has Python call? False
[DEBU] 2023-10-14 21:37:58.571170 ESPEAKTTSWrapper: Has C extension call? True
[DEBU] 2023-10-14 21:37:58.571183 ESPEAKTTSWrapper: Has subprocess call? True
[DEBU] 2023-10-14 21:37:58.571205 ESPEAKTTSWrapper: Subprocess arguments: ['espeak', '-v', 'VOICE_CODE_STRING', '-w', 'WAVE_PATH', 'TEXT_STDIN']
[DEBU] 2023-10-14 21:37:58.571227 Synthesizer: Selecting TTS engine... done
[DEBU] 2023-10-14 21:37:58.571239 ExecuteTask: Setting synthesizer... done
[DEBU] 2023-10-14 21:37:58.571366 ExecuteTask: STEP 3 BEGIN (synthesize text)
[DEBU] 2023-10-14 21:37:58.571826 Synthesizer: Synthesizing text...
[DEBU] 2023-10-14 21:37:58.572540 ESPEAKTTSWrapper: Calling TTS engine via C extension or subprocess
[DEBU] 2023-10-14 21:37:58.572600 ESPEAKTTSWrapper: C extension 'cew' enabled
[DEBU] 2023-10-14 21:37:58.691740 ESPEAKTTSWrapper: C extension 'cew' enabled and it can be loaded
[DEBU] 2023-10-14 21:37:58.691839 ESPEAKTTSWrapper: Synthesizing using C extension...
(But I haven't tried the other packages mentioned here...)
So, I'm using this to align the text I get from a TTS engine, a pretty good one, too (eleven labs). To me that sounds like a perfect task: no mic noise, no background sounds, English language, and the volume is really stable. Still, not sure what I'm doing wrong here, but it works extremely poorly. To the point, the result is pretty much unusable. Half the words (by the way, yes, I'm aligning per word) are crushed into a 0-second long interval and the others are just overly long periods of time spaced around randomly. I feel like I would get a much better result by just approximating the mapping with character/vowel count. By just how bad it is, I'm guessing this is not how Aeneas normally is, so what could be a problem causing generally bad performance? I'm not getting any errors, I'm running win11 and I process fairly small chunks of text.