Open eschmidbauer opened 1 month ago
anyone think e2-stt is worth exploring? seems obvious and either we just wait for the paper or go ahead and try it
In fact, with some thought, probably possible to do with one unified architecture
anyone think e2-stt is worth exploring? seems obvious and either we just wait for the paper or go ahead and try it
In fact, with some thought, probably possible to do with one unified architecture
I believe this architecture is very much worth further exploration because it no longer requires an additional alignment module and expands the paradigm of non-autoregressive speech synthesis. A potential future direction could be how to achieve training with less data, especially considering that the setup in the paper is beyond what I, as a graduate student, can afford. Do you have any better suggestions for further research? Also, I have been using my own pipeline with your network structure but have not been able to replicate the results. While the test outputs sound like speech, they are actually just gibberish, which is quite frustrating. I suspect the issue might be with the phoneme dictionary I'm using.
on the original e2tts code with no modifications or other papers, i was able to get coherent output after ~4 days of training on 8h100s, with a half size model and just with the globe dataset. performance was only OK for seen speakers and no good for unseen speakers. word error rate was awful. it's hard to say if the original impl has a bug in it, or if it just takes a ton of data and time to train.
i think the new model is promising - @lucasnewman do you have a checkpoint you could share? i want to see if i can get it to train faster by making the model bigger and organizing the training data to bin by duration so we get better gpu utilization... ridiculous amount of compute needed
anyone think e2-stt is worth exploring? seems obvious and either we just wait for the paper or go ahead and try it In fact, with some thought, probably possible to do with one unified architecture
I believe this architecture is very much worth further exploration because it no longer requires an additional alignment module and expands the paradigm of non-autoregressive speech synthesis. A potential future direction could be how to achieve training with less data, especially considering that the setup in the paper is beyond what I, as a graduate student, can afford. Do you have any better suggestions for further research? Also, I have been using my own pipeline with your network structure but have not been able to replicate the results. While the test outputs sound like speech, they are actually just gibberish, which is quite frustrating. I suspect the issue might be with the phoneme dictionary I'm using.
indeed, was (and still am) waiting for a big catch to all this (maybe breaks down at longer prompts?), as it is too good to be true
one can even start incorporating 3+ sequences without any alignment engineering, given sufficient data and compute. it is a pretty big discovery, but let's wait for more independent verification before getting too excited
on the original e2tts code with no modifications or other papers, i was able to get coherent output after ~4 days of training on 8h100s, with a half size model and just with the globe dataset. performance was only OK for seen speakers and no good for unseen speakers. word error rate was awful. it's hard to say if the original impl has a bug in it, or if it just takes a ton of data and time to train.
i think the new model is promising - @lucasnewman do you have a checkpoint you could share? i want to see if i can get it to train faster by making the model bigger and organizing the training data to bin by duration so we get better gpu utilization... ridiculous amount of compute needed
why not retrain with the new multi stream transformer approach?
ah yeah, ive been working towards trying the new setup too, just concerned about utilizing my compute well - also, one thing i was concerned about is whether the new approach might hurt speaker similarity or long term wer -wdyt? i guess only way to find out is to train it huh?
From the WER graphs, we observe that the Voicebox models
demonstrated a good WER even at the 10% training point, owing
to the use of frame-wise phoneme alignment. On the other hand,
E2 TTS required significantly more training to converge. Interestingly, E2 TTS achieved a better WER at the end of the training.
We speculate this is because the E2 TTS model learned a more effective grapheme-to-phoneme mapping based on the large training
data, compared to what was used for Voicebox.
i think the new model is promising - @lucasnewman do you have a checkpoint you could share? i want to see if i can get it to train faster by making the model bigger and organizing the training data to bin by duration so we get better gpu utilization... ridiculous amount of compute needed
The arch has changed slightly since my latest checkpoint so it wouldn't be usable as-is, but if you open a discussion topic on training efficiency I'm happy to share all the tips since I've put some effort into making it work. If you want to be really aggressive about it there are definitely some higher effort optimizations you can try.
FWIW I was able to get voice cloning with arbitrary speech working fine on LibriTTS-R with a single H100, so you should have more than enough compute to train the full model. Reading between the lines, I would guess the authors used an 8xA100 rig to train the original model.
It does take a while — from my napkin math they used 3200 secs of audio per step for 800k steps which is ~700k hours of audio seen during training which would be equivalent to 1300 epochs of your 535 hour dataset, so just keep that in mind.
This is intuitive to an extent because alignment is a hard problem and there isn't any auxiliary conditioning or loss that's trying to specifically help learn it, which (as Phil was alluding to) is why this is a pretty interesting paper — the simplicity is beautiful!
ah yeah, ive been working towards trying the new setup too, just concerned about utilizing my compute well - also, one thing i was concerned about is whether the new approach might hurt speaker similarity or long term wer -wdyt? i guess only way to find out is to train it huh?
I don't really see why it would hurt per se — if anything you would be paying for the extra parameters for no additional gain. But worst case you can just make the text transformer tiny and approximate the arch described in the paper.
awesome i opened a thread -thats great RE arbitrary speech, do you know if it's any good for unseen speakers?
i struggle to fit much data in gpu with the full size model even with mixed precision, so i'm curious how you hypothesized they used 8 a100s - my batch size needs to be super small, to get to the number of hours of the paper it would take a month or more lol. which is maybe what they did..
@lucasnewman have you see something generating long sequence >30s My observations are similar to yours, but I couldn't make the long sentences work
i struggle to fit much data in gpu with the full size model even with mixed precision, so i'm curious how you hypothesized they used 8 a100s - my batch size needs to be super small, to get to the number of hours of the paper it would take a month or more lol. which is maybe what they did..
use adaptivebatchsampler e.g. leverage buckets enable flash attn
@lucasnewman have you see something generating long sequence >30s My observations are similar to yours, but I couldn't make the long sentences work
I haven't tried, but I'm curious if that's genuinely useful? It seems like it would be really slow to sample, and you can just subdivide a paragraph down to sentences and generate them individually (even in parallel!) and put them back together, i.e. that's more of a systems problem than something you'd want the model to do. Maybe if you wanted some kind of long conditioning for voice matching it makes sense?
@lucasnewman have you see something generating long sequence >30s My observations are similar to yours, but I couldn't make the long sentences work
I haven't tried, but I'm curious if that's genuinely useful? It seems like it would be really slow to sample, and you can just subdivide a paragraph down to sentences and generate them individually (even in parallel!) and put them back together, i.e. that's more of a systems problem than something you'd want the model to do. Maybe if you wanted some kind of long conditioning for voice matching it makes sense?
Specifically, i was actually testing with prompt on seed-tts demo page, but that might seem overreached The point is, generating longer samples i.e. >10s >20s, repetition and stuttering become severe
If many cuts, there would be incoherence of prosody timbre etc. between cuts more of less
In fact, i tried with vanilla e2 and other dit based version vanilla is slow to converge, others are much faster, BUT vanilla one is showing more robustness as it repeat characters less (btw i train with char/full pinyin) dit-based models repeat more although all failed to genereate long samples >20s >30s, vanilla one failed totally as it just give gibberish, dits could have few words right and a order to a small extent
maybe larger dataset help, or it is an inherently trade-off of alignment effiency and stability what do you think
@lucasnewman have you see something generating long sequence >30s My observations are similar to yours, but I couldn't make the long sentences work
I haven't tried, but I'm curious if that's genuinely useful? It seems like it would be really slow to sample, and you can just subdivide a paragraph down to sentences and generate them individually (even in parallel!) and put them back together, i.e. that's more of a systems problem than something you'd want the model to do. Maybe if you wanted some kind of long conditioning for voice matching it makes sense?
Specifically, i was actually testing with prompt on seed-tts demo page, but that might seem overreached The point is, generating longer samples i.e. >10s >20s, repetition and stuttering become severe
If many cuts, there would be incoherence of prosody timbre etc. between cuts more of less
In fact, i tried with vanilla e2 and other dit based version vanilla is slow to converge, others are much faster, BUT vanilla one is showing more robustness as it repeat characters less (btw i train with char/full pinyin) dit-based models repeat more although all failed to genereate long samples >20s >30s, vanilla one failed totally as it just give gibberish, dits could have few words right and a order to a small extent
maybe larger dataset help, or it is an inherently trade-off of alignment effiency and stability what do you think
The reason it cannot generate long audio might be that it has not encountered such cases during training. Increasing the amount of training data with long sentences might help improve this.
The reason it cannot generate long audio might be that it has not encountered such cases during training. Increasing the amount of training data with long sentences might help improve this.
my training include up to 25s samples. models see same train set, but behave differently but surely include more long samples would help directly
@SWivid is this the old or multistream architecture? and remind me but you aren't using this repo specifically right? sharing some results you are hearing would help. you should do something more rigorous and define at which length it starts to break down (share the audio files)
i'll propose an idea once it is clearer from multiple people if / where the degradation starts
@lucidrains i share some inference results here for discussion each text with 3 random seed result
prompt audio from wenet4tts demo page first unseen speaker sample
<4 seconds> ref text: 而在新闻领域的奇葩说目前仍空缺。(And the Qi-Pa-Shuo in journalism is still missing.) and i generated for <7 seconds> gen text: 这时,朱警官等人才发现小男孩腿脚也异常,根本走不了路。(At this time, Zhu and other officers found that the boy's legs and feet are abnormal, can not walk at all.) <24 seconds> gen text: 这时,朱警官等人才发现小男孩腿脚也异常,根本走不了路。朱警官立刻意识到情况的严重性,他迅速蹲下身,轻轻地检查小男孩的腿部。小男孩的脸上露出痛苦的表情,显然是腿部受伤了。(At this time, Zhu and other officers found that the boy's legs and feet are abnormal, can not walk at all. Officer Zhu immediately realized the seriousness of the situation. He quickly squatted down and gently examined the boy's legs. There was a pained expression on the little boy's face. It was obvious that he had hurt his leg.) [test_unseen3_spk_450k.zip](https://github.com/user-attachments/files/16850500/test_unseen3_spk_450k.zip) as i mentioned above, vanilla e2 (unet-transformer structure) is more stable, less repeat of words phrases, but failed to long sentence dit (add extra layers on context before get into dit-blocks) which has more separate model space for text and masked cond audio, handle better alignment but unstable with much repeat (ablation on dit/unet-trans not test, but i thought it would be for the extra layers) vanilla took 200k training for me to hear something intelligible dit took 150k also as we have discussed before, i tried mmdit which took 50~100k to get some aligned, but inherit more instability, and after 400k, the timbre collapse for zero-shot testas i mentioned above,
vanilla e2 (unet-transformer structure) is more stable, less repeat of words phrases, but failed to long sentence dit (add extra layers on context before get into dit-blocks) which has more separate model space for text and masked cond audio, handle better alignment but unstable with much repeat (ablation on dit/unet-trans not test, but i thought it would be for the extra layers)
vanilla took 200k training for me to hear something intelligible dit took 150k also as we have discussed before, i tried mmdit which took 50~100k to get some aligned, but inherit more instability, and after 400k, the timbre collapse for zero-shot test
what i see is, DiT handles alignment better, so it can handle sentences with duration longer that of the training set (here we do inference with 7s+24s, the longest in train set is 20s) Vanilla structure failed to do this, but more stable to have less repeat
maybe larger dataset help, or it is an inherently trade-off of alignment effiency and stability
we can just add more long samples into our train set, but isn't the problem with the model itself still there
@SWivid thank you Yushen
would you like to give the multistream transformer a try? in the meanwhile, this is my proposal, which i can implement tomorrow morning. the idea is to simply take the text, give it a bit of absolute positional embedding, and then 1d interpolate it to the same length as the audio.
as i mentioned above, vanilla e2 (unet-transformer structure) is more stable, less repeat of words phrases, but failed to long sentence dit (add extra layers on context before get into dit-blocks) which has more separate model space for text and masked cond audio, handle better alignment but unstable with much repeat (ablation on dit/unet-trans not test, but i thought it would be for the extra layers) vanilla took 200k training for me to hear something intelligible dit took 150k also as we have discussed before, i tried mmdit which took 50~100k to get some aligned, but inherit more instability, and after 400k, the timbre collapse for zero-shot test
what i see is, DiT handles alignment better, so it can handle sentences with duration longer that of the training set (here we do inference with 7s+24s, the longest in train set is 20s) Vanilla structure failed to do this, but more stable to have less repeat
maybe larger dataset help, or it is an inherently trade-off of alignment effiency and stability
we can just add more long samples into our train set, but isn't the problem with the model itself still there
what is the DiT architecture you are using? does it have cross attention to some text embedding? or is it just vanilla e2 without the unet skip connections
7 seconds doesn't sound that bad except the very end (i can understand a bit of mandarin). do you have a sample for something in between 7 and 24? say 15?
what is the DiT architecture you are using? does it have cross attention to some text embedding? or is it just vanilla e2 without the unet skip connections
https://github.com/bfs18/e2_tts/blob/main/rfwave/input.py#L120 as bfs18 does, but i keep other settings as in the paper, just take the convnextv2 blocks before got them into the transformer
for cross attention, what mmdit is this kind of structure. from which i thought the conditioning and separate model space is too strong, which will harm the performance. that's why i was curious about lucas' result on long samples' inference
@SWivid got it
and yes, mmdit should automatically carry out cross attention and self attention in one block
i wonder if what we are talking about is an inherent limitation of NAR. this issue could exist even in say voicebox
@SWivid nonetheless, let me throw in the option to do the interpolated text tomorrow morning and we'll chip away at it
if it only generates 10s, that's not too bad 😄
7 seconds doesn't sound that bad except the very end (i can understand a bit of mandarin). do you have a sample for something in between 7 and 24? say 15?
7s samples, vanilla do no repeat, but dit sometime
@SWivid when i get around to this paper i will circle back to e2-tts and apply it to flow matching. the big idea emerging is separate noise levels per time step / bucket, so bringing semi-autoregressive to diffusion basically
added the 1d text interpolation strategy this morning here
welcome anyone to give it a test drive
I have audio data at 16kHz, so I retrained a 16kHz version of vocos, and the generated results were fine. However, to make it compatible with the original 24kHz version of vocos, I forcibly resampled the 16kHz data to 24kHz in the dataset class and ensured the data processing matched the front-end handling of vocos. But the final output sounds very muffled. Can anyone tell me what might be going wrong?
@JingRH that is a vocos related issue?
@JingRH that is a vocos related issue?
@lucidrains I've resolved the issue. I had initially used different resampling methods (librosa and torchaudio) during training and inference. After switching to the same method, the problem was gone. I didn’t expect this to cause an issue!
@JingRH ok nice
but how about e2-tts? did you see audio aligning with the text yet?
@JingRH also, what anime is that. is that jojo?
@JingRH ok nice
but how about e2-tts? did you see audio aligning with the text yet?
Yeah, the initial alignment experiments have been validated (though generating long sentences remains challenging). I'm currently working on validating two additional experiments: multilingual (English and Chinese) and multi-emotion (TextrolSpeech).
@JingRH also, what anime is that. is that jojo?
Haha, yes, that's Dio, a very charismatic anti-hero from JoJo's.
@lucasnewman @darylsew let's grab dinner in october to celebrate! Daryl is also interested in joining, but provided he replicates it too 😄
@JingRH do try the text interpolation strategy if you have some time, if it is a negative result, i'll remove it and wrap up the project
@JingRH do try the text interpolation strategy if you have some time, if it is a negative result, i'll remove it and wrap up the project
I don't have any additional machines to run this experiment. Maybe someone else would be willing to give it a try.
@JingRH do try the text interpolation strategy if you have some time, if it is a negative result, i'll remove it and wrap up the project I don't have any additional machines to run this experiment. Maybe someone else would be willing to give it a try.
haha no worries, it is great that you shared the positive results already
i guess this paper ended up being a success
@JingRH do try the text interpolation strategy if you have some time, if it is a negative result, i'll remove it and wrap up the project I don't have any additional machines to run this experiment. Maybe someone else would be willing to give it a try.
haha no worries, it is great that you shared the positive results already
i guess this paper ended up being a success
I gave it a shot with side-by-side runs and it didn't really provide any significant lift that I saw, maybe slightly worse but it's in the noise.
@lucasnewman thank you! 🙏 ok, i'll plan on removing it then and move on to trying rolling diffusion approach at a later date!
@lucidrains I can also confirm the text embeddings are working. I was actually able to synthesize coherent sentences with a ~125mil param multi-speaker model. (I can't share any samples since it is an internal dataset) It seems like earlier on in training it can do short sentences with short reference audio and the longer you train it the better it gets with longer input. My gut tells me there is a shortcut to training this thing; something that speeds up alignment dramatically. What that is, I don't know 🤐
Just a few notes for anyone who is interested: -If you're not getting anything that sounds correct, try with a short 1-2 sec reference audio and a short phrase as the text to synthesize, and generate a few files. -It takes a while to train, especially at scale, I can't think of any other TTS system that takes more training compute? 🤔 -I'd suggest If you're not training a zero shot model or doing research and need TTS, use a derivative of VITS or something else entirely. There is so much randomness in the output, it can give very bad results and broken sentences. (Although my model is not fully trained, so hopefully that happens less often when trained more/fully.) -I didn't use mel spec to train, so I can confirm alternative audio features also work.
All that being said, just from the samples i've generated I can see this model giving SOTA zero-shot performance if trained on a large enough dataset with enough parameters.
What I'm still unsure of is, does E2TTS have superior prosody generation if compared to another model like for example, tortoise with clvp?
@Coice so much win
Thank you!
I have 600 hours of Mandarin data that I'm using to train a model with phonemes. Does anyone have experience with this? My batch size is 12, and each epoch takes about 23,000 steps. The peak learning rate is set to 9e-5 with 20,000 warmup steps. Could anyone provide feedback on these network parameters and hyperparameter settings? Any suggestions for improvement would be greatly appreciated.
@JingRH Try increasing your learning rate to 3e-4 or 2e-4. I only warmed up 1000 steps. I also disabled the gateloop layers.
@Coice Great, thank you for your suggestions! Do you think my network parameters (like depth or emb_dim) need any further adjustments?
@JingRH I was using a larger model, depth/heads = 16 and dim= 512. But @lucasnewman saw results with a similar configuration to what you have. (I am assuming 'emb_dim' in your config is equivalent to the transformer dimensions, aka 'dim' in this repo)
@JingRH Try increasing your learning rate to 3e-4 or 2e-4. I only warmed up 1000 steps. I also disabled the gateloop layers.
aw, you didn't see anything with the associative scan based layers? some papers claim it is helpful in an audio setting, but i can believe that attention can do everything at this point.
i'll probably remove it if nobody ends up using it
@lucidrains It was an effort to keep the parameter count down and @lucasnewman was seeing results with it off. It may improve model performance, hopefully someone else gives it a shot.
@lucidrains I can also confirm the text embeddings are working. I was actually able to synthesize coherent sentences with a ~125mil param multi-speaker model. (I can't share any samples since it is an internal dataset) It seems like earlier on in training it can do short sentences with short reference audio and the longer you train it the better it gets with longer input. My gut tells me there is a shortcut to training this thing; something that speeds up alignment dramatically. What that is, I don't know 🤐
Just a few notes for anyone who is interested: -If you're not getting anything that sounds correct, try with a short 1-2 sec reference audio and a short phrase as the text to synthesize, and generate a few files. -It takes a while to train, especially at scale, I can't think of any other TTS system that takes more training compute? 🤔 -I'd suggest If you're not training a zero shot model or doing research and need TTS, use a derivative of VITS or something else entirely. There is so much randomness in the output, it can give very bad results and broken sentences. (Although my model is not fully trained, so hopefully that happens less often when trained more/fully.) -I didn't use mel spec to train, so I can confirm alternative audio features also work.
All that being said, just from the samples i've generated I can see this model giving SOTA zero-shot performance if trained on a large enough dataset with enough parameters.
What I'm still unsure of is, does E2TTS have superior prosody generation if compared to another model like for example, tortoise with clvp?
asking few question if you dont mind
sorry if it too much to ask thanks
@lucidrains It was an effort to keep the parameter count down and @lucasnewman was seeing results with it off. It may improve model performance, hopefully someone else gives it a shot.
I can try it — I wasn't trying to make a comment on the effectiveness of gateloop earlier. I was just starting with the simplest possible thing to get it working 😅
Is there inference code? I could not find any. but I read through other issues and found this.
Originally posted by @manmay-nakhashi in https://github.com/lucidrains/e2-tts-pytorch/issues/1#issuecomment-2227175532