lucidrains / e2-tts-pytorch

Implementation of E2-TTS, "Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS", in Pytorch
MIT License
228 stars 21 forks source link

inference code example #8

Open eschmidbauer opened 1 month ago

eschmidbauer commented 1 month ago

Is there inference code? I could not find any. but I read through other issues and found this.

          i'll write a inference script next so we can do some quick experiments.

Originally posted by @manmay-nakhashi in https://github.com/lucidrains/e2-tts-pytorch/issues/1#issuecomment-2227175532

lucidrains commented 1 week ago

anyone think e2-stt is worth exploring? seems obvious and either we just wait for the paper or go ahead and try it

In fact, with some thought, probably possible to do with one unified architecture

JingRH commented 1 week ago

anyone think e2-stt is worth exploring? seems obvious and either we just wait for the paper or go ahead and try it

In fact, with some thought, probably possible to do with one unified architecture

I believe this architecture is very much worth further exploration because it no longer requires an additional alignment module and expands the paradigm of non-autoregressive speech synthesis. A potential future direction could be how to achieve training with less data, especially considering that the setup in the paper is beyond what I, as a graduate student, can afford. Do you have any better suggestions for further research? Also, I have been using my own pipeline with your network structure but have not been able to replicate the results. While the test outputs sound like speech, they are actually just gibberish, which is quite frustrating. I suspect the issue might be with the phoneme dictionary I'm using.

darylsew commented 1 week ago

on the original e2tts code with no modifications or other papers, i was able to get coherent output after ~4 days of training on 8h100s, with a half size model and just with the globe dataset. performance was only OK for seen speakers and no good for unseen speakers. word error rate was awful. it's hard to say if the original impl has a bug in it, or if it just takes a ton of data and time to train.

i think the new model is promising - @lucasnewman do you have a checkpoint you could share? i want to see if i can get it to train faster by making the model bigger and organizing the training data to bin by duration so we get better gpu utilization... ridiculous amount of compute needed

lucidrains commented 1 week ago

anyone think e2-stt is worth exploring? seems obvious and either we just wait for the paper or go ahead and try it In fact, with some thought, probably possible to do with one unified architecture

I believe this architecture is very much worth further exploration because it no longer requires an additional alignment module and expands the paradigm of non-autoregressive speech synthesis. A potential future direction could be how to achieve training with less data, especially considering that the setup in the paper is beyond what I, as a graduate student, can afford. Do you have any better suggestions for further research? Also, I have been using my own pipeline with your network structure but have not been able to replicate the results. While the test outputs sound like speech, they are actually just gibberish, which is quite frustrating. I suspect the issue might be with the phoneme dictionary I'm using.

indeed, was (and still am) waiting for a big catch to all this (maybe breaks down at longer prompts?), as it is too good to be true

one can even start incorporating 3+ sequences without any alignment engineering, given sufficient data and compute. it is a pretty big discovery, but let's wait for more independent verification before getting too excited

lucidrains commented 1 week ago

on the original e2tts code with no modifications or other papers, i was able to get coherent output after ~4 days of training on 8h100s, with a half size model and just with the globe dataset. performance was only OK for seen speakers and no good for unseen speakers. word error rate was awful. it's hard to say if the original impl has a bug in it, or if it just takes a ton of data and time to train.

i think the new model is promising - @lucasnewman do you have a checkpoint you could share? i want to see if i can get it to train faster by making the model bigger and organizing the training data to bin by duration so we get better gpu utilization... ridiculous amount of compute needed

why not retrain with the new multi stream transformer approach?

darylsew commented 1 week ago

ah yeah, ive been working towards trying the new setup too, just concerned about utilizing my compute well - also, one thing i was concerned about is whether the new approach might hurt speaker similarity or long term wer -wdyt? i guess only way to find out is to train it huh?

From the WER graphs, we observe that the Voicebox models
demonstrated a good WER even at the 10% training point, owing
to the use of frame-wise phoneme alignment. On the other hand,
E2 TTS required significantly more training to converge. Interestingly, E2 TTS achieved a better WER at the end of the training.
We speculate this is because the E2 TTS model learned a more effective grapheme-to-phoneme mapping based on the large training
data, compared to what was used for Voicebox.
lucasnewman commented 1 week ago

i think the new model is promising - @lucasnewman do you have a checkpoint you could share? i want to see if i can get it to train faster by making the model bigger and organizing the training data to bin by duration so we get better gpu utilization... ridiculous amount of compute needed

The arch has changed slightly since my latest checkpoint so it wouldn't be usable as-is, but if you open a discussion topic on training efficiency I'm happy to share all the tips since I've put some effort into making it work. If you want to be really aggressive about it there are definitely some higher effort optimizations you can try.

FWIW I was able to get voice cloning with arbitrary speech working fine on LibriTTS-R with a single H100, so you should have more than enough compute to train the full model. Reading between the lines, I would guess the authors used an 8xA100 rig to train the original model.

It does take a while — from my napkin math they used 3200 secs of audio per step for 800k steps which is ~700k hours of audio seen during training which would be equivalent to 1300 epochs of your 535 hour dataset, so just keep that in mind.

This is intuitive to an extent because alignment is a hard problem and there isn't any auxiliary conditioning or loss that's trying to specifically help learn it, which (as Phil was alluding to) is why this is a pretty interesting paper — the simplicity is beautiful!

ah yeah, ive been working towards trying the new setup too, just concerned about utilizing my compute well - also, one thing i was concerned about is whether the new approach might hurt speaker similarity or long term wer -wdyt? i guess only way to find out is to train it huh?

I don't really see why it would hurt per se — if anything you would be paying for the extra parameters for no additional gain. But worst case you can just make the text transformer tiny and approximate the arch described in the paper.

darylsew commented 1 week ago

awesome i opened a thread -thats great RE arbitrary speech, do you know if it's any good for unseen speakers?

i struggle to fit much data in gpu with the full size model even with mixed precision, so i'm curious how you hypothesized they used 8 a100s - my batch size needs to be super small, to get to the number of hours of the paper it would take a month or more lol. which is maybe what they did..

SWivid commented 1 week ago

@lucasnewman have you see something generating long sequence >30s My observations are similar to yours, but I couldn't make the long sentences work

SWivid commented 1 week ago

i struggle to fit much data in gpu with the full size model even with mixed precision, so i'm curious how you hypothesized they used 8 a100s - my batch size needs to be super small, to get to the number of hours of the paper it would take a month or more lol. which is maybe what they did..

use adaptivebatchsampler e.g. leverage buckets enable flash attn

lucasnewman commented 1 week ago

@lucasnewman have you see something generating long sequence >30s My observations are similar to yours, but I couldn't make the long sentences work

I haven't tried, but I'm curious if that's genuinely useful? It seems like it would be really slow to sample, and you can just subdivide a paragraph down to sentences and generate them individually (even in parallel!) and put them back together, i.e. that's more of a systems problem than something you'd want the model to do. Maybe if you wanted some kind of long conditioning for voice matching it makes sense?

SWivid commented 1 week ago

@lucasnewman have you see something generating long sequence >30s My observations are similar to yours, but I couldn't make the long sentences work

I haven't tried, but I'm curious if that's genuinely useful? It seems like it would be really slow to sample, and you can just subdivide a paragraph down to sentences and generate them individually (even in parallel!) and put them back together, i.e. that's more of a systems problem than something you'd want the model to do. Maybe if you wanted some kind of long conditioning for voice matching it makes sense?

Specifically, i was actually testing with prompt on seed-tts demo page, but that might seem overreached The point is, generating longer samples i.e. >10s >20s, repetition and stuttering become severe

If many cuts, there would be incoherence of prosody timbre etc. between cuts more of less

In fact, i tried with vanilla e2 and other dit based version vanilla is slow to converge, others are much faster, BUT vanilla one is showing more robustness as it repeat characters less (btw i train with char/full pinyin) dit-based models repeat more although all failed to genereate long samples >20s >30s, vanilla one failed totally as it just give gibberish, dits could have few words right and a order to a small extent

maybe larger dataset help, or it is an inherently trade-off of alignment effiency and stability what do you think

JingRH commented 1 week ago

@lucasnewman have you see something generating long sequence >30s My observations are similar to yours, but I couldn't make the long sentences work

I haven't tried, but I'm curious if that's genuinely useful? It seems like it would be really slow to sample, and you can just subdivide a paragraph down to sentences and generate them individually (even in parallel!) and put them back together, i.e. that's more of a systems problem than something you'd want the model to do. Maybe if you wanted some kind of long conditioning for voice matching it makes sense?

Specifically, i was actually testing with prompt on seed-tts demo page, but that might seem overreached The point is, generating longer samples i.e. >10s >20s, repetition and stuttering become severe

If many cuts, there would be incoherence of prosody timbre etc. between cuts more of less

In fact, i tried with vanilla e2 and other dit based version vanilla is slow to converge, others are much faster, BUT vanilla one is showing more robustness as it repeat characters less (btw i train with char/full pinyin) dit-based models repeat more although all failed to genereate long samples >20s >30s, vanilla one failed totally as it just give gibberish, dits could have few words right and a order to a small extent

maybe larger dataset help, or it is an inherently trade-off of alignment effiency and stability what do you think

The reason it cannot generate long audio might be that it has not encountered such cases during training. Increasing the amount of training data with long sentences might help improve this.

SWivid commented 1 week ago

The reason it cannot generate long audio might be that it has not encountered such cases during training. Increasing the amount of training data with long sentences might help improve this.

my training include up to 25s samples. models see same train set, but behave differently but surely include more long samples would help directly

lucidrains commented 1 week ago

@SWivid is this the old or multistream architecture? and remind me but you aren't using this repo specifically right? sharing some results you are hearing would help. you should do something more rigorous and define at which length it starts to break down (share the audio files)

lucidrains commented 1 week ago

i'll propose an idea once it is clearer from multiple people if / where the degradation starts

SWivid commented 1 week ago

@lucidrains i share some inference results here for discussion each text with 3 random seed result

prompt audio from wenet4tts demo page first unseen speaker sample

<4 seconds> ref text: 而在新闻领域的奇葩说目前仍空缺。(And the Qi-Pa-Shuo in journalism is still missing.) and i generated for <7 seconds> gen text: 这时,朱警官等人才发现小男孩腿脚也异常,根本走不了路。(At this time, Zhu and other officers found that the boy's legs and feet are abnormal, can not walk at all.) <24 seconds> gen text: 这时,朱警官等人才发现小男孩腿脚也异常,根本走不了路。朱警官立刻意识到情况的严重性,他迅速蹲下身,轻轻地检查小男孩的腿部。小男孩的脸上露出痛苦的表情,显然是腿部受伤了。(At this time, Zhu and other officers found that the boy's legs and feet are abnormal, can not walk at all. Officer Zhu immediately realized the seriousness of the situation. He quickly squatted down and gently examined the boy's legs. There was a pained expression on the little boy's face. It was obvious that he had hurt his leg.) [test_unseen3_spk_450k.zip](https://github.com/user-attachments/files/16850500/test_unseen3_spk_450k.zip) as i mentioned above, vanilla e2 (unet-transformer structure) is more stable, less repeat of words phrases, but failed to long sentence dit (add extra layers on context before get into dit-blocks) which has more separate model space for text and masked cond audio, handle better alignment but unstable with much repeat (ablation on dit/unet-trans not test, but i thought it would be for the extra layers) vanilla took 200k training for me to hear something intelligible dit took 150k also as we have discussed before, i tried mmdit which took 50~100k to get some aligned, but inherit more instability, and after 400k, the timbre collapse for zero-shot test
SWivid commented 1 week ago

as i mentioned above,

vanilla e2 (unet-transformer structure) is more stable, less repeat of words phrases, but failed to long sentence dit (add extra layers on context before get into dit-blocks) which has more separate model space for text and masked cond audio, handle better alignment but unstable with much repeat (ablation on dit/unet-trans not test, but i thought it would be for the extra layers)

vanilla took 200k training for me to hear something intelligible dit took 150k also as we have discussed before, i tried mmdit which took 50~100k to get some aligned, but inherit more instability, and after 400k, the timbre collapse for zero-shot test

what i see is, DiT handles alignment better, so it can handle sentences with duration longer that of the training set (here we do inference with 7s+24s, the longest in train set is 20s) Vanilla structure failed to do this, but more stable to have less repeat

maybe larger dataset help, or it is an inherently trade-off of alignment effiency and stability

we can just add more long samples into our train set, but isn't the problem with the model itself still there

lucidrains commented 1 week ago

@SWivid thank you Yushen

would you like to give the multistream transformer a try? in the meanwhile, this is my proposal, which i can implement tomorrow morning. the idea is to simply take the text, give it a bit of absolute positional embedding, and then 1d interpolate it to the same length as the audio.

lucidrains commented 1 week ago

as i mentioned above, vanilla e2 (unet-transformer structure) is more stable, less repeat of words phrases, but failed to long sentence dit (add extra layers on context before get into dit-blocks) which has more separate model space for text and masked cond audio, handle better alignment but unstable with much repeat (ablation on dit/unet-trans not test, but i thought it would be for the extra layers) vanilla took 200k training for me to hear something intelligible dit took 150k also as we have discussed before, i tried mmdit which took 50~100k to get some aligned, but inherit more instability, and after 400k, the timbre collapse for zero-shot test

what i see is, DiT handles alignment better, so it can handle sentences with duration longer that of the training set (here we do inference with 7s+24s, the longest in train set is 20s) Vanilla structure failed to do this, but more stable to have less repeat

maybe larger dataset help, or it is an inherently trade-off of alignment effiency and stability

we can just add more long samples into our train set, but isn't the problem with the model itself still there

what is the DiT architecture you are using? does it have cross attention to some text embedding? or is it just vanilla e2 without the unet skip connections

lucidrains commented 1 week ago

7 seconds doesn't sound that bad except the very end (i can understand a bit of mandarin). do you have a sample for something in between 7 and 24? say 15?

SWivid commented 1 week ago

what is the DiT architecture you are using? does it have cross attention to some text embedding? or is it just vanilla e2 without the unet skip connections

https://github.com/bfs18/e2_tts/blob/main/rfwave/input.py#L120 as bfs18 does, but i keep other settings as in the paper, just take the convnextv2 blocks before got them into the transformer

for cross attention, what mmdit is this kind of structure. from which i thought the conditioning and separate model space is too strong, which will harm the performance. that's why i was curious about lucas' result on long samples' inference

lucidrains commented 1 week ago

@SWivid got it

and yes, mmdit should automatically carry out cross attention and self attention in one block

i wonder if what we are talking about is an inherent limitation of NAR. this issue could exist even in say voicebox

lucidrains commented 1 week ago

@SWivid nonetheless, let me throw in the option to do the interpolated text tomorrow morning and we'll chip away at it

if it only generates 10s, that's not too bad 😄

SWivid commented 1 week ago

7 seconds doesn't sound that bad except the very end (i can understand a bit of mandarin). do you have a sample for something in between 7 and 24? say 15?

7s samples, vanilla do no repeat, but dit sometime

test_unseen3_spk_450k_14s.zip

lucidrains commented 1 week ago

@SWivid when i get around to this paper i will circle back to e2-tts and apply it to flow matching. the big idea emerging is separate noise levels per time step / bucket, so bringing semi-autoregressive to diffusion basically

lucidrains commented 1 week ago

added the 1d text interpolation strategy this morning here

welcome anyone to give it a test drive

JingRH commented 1 week ago

I have audio data at 16kHz, so I retrained a 16kHz version of vocos, and the generated results were fine. However, to make it compatible with the original 24kHz version of vocos, I forcibly resampled the 16kHz data to 24kHz in the dataset class and ensured the data processing matched the front-end handling of vocos. But the final output sounds very muffled. Can anyone tell me what might be going wrong?

lucidrains commented 1 week ago

@JingRH that is a vocos related issue?

JingRH commented 1 week ago

@JingRH that is a vocos related issue?

@lucidrains I've resolved the issue. I had initially used different resampling methods (librosa and torchaudio) during training and inference. After switching to the same method, the problem was gone. I didn’t expect this to cause an issue!

lucidrains commented 1 week ago

@JingRH ok nice

but how about e2-tts? did you see audio aligning with the text yet?

lucidrains commented 1 week ago

@JingRH also, what anime is that. is that jojo?

JingRH commented 1 week ago

@JingRH ok nice

but how about e2-tts? did you see audio aligning with the text yet?

Yeah, the initial alignment experiments have been validated (though generating long sentences remains challenging). I'm currently working on validating two additional experiments: multilingual (English and Chinese) and multi-emotion (TextrolSpeech).

JingRH commented 1 week ago

@JingRH also, what anime is that. is that jojo?

Haha, yes, that's Dio, a very charismatic anti-hero from JoJo's.

lucidrains commented 1 week ago

@lucasnewman @darylsew let's grab dinner in october to celebrate! Daryl is also interested in joining, but provided he replicates it too 😄

lucidrains commented 1 week ago

@JingRH do try the text interpolation strategy if you have some time, if it is a negative result, i'll remove it and wrap up the project

JingRH commented 1 week ago

@JingRH do try the text interpolation strategy if you have some time, if it is a negative result, i'll remove it and wrap up the project

I don't have any additional machines to run this experiment. Maybe someone else would be willing to give it a try.

lucidrains commented 1 week ago

@JingRH do try the text interpolation strategy if you have some time, if it is a negative result, i'll remove it and wrap up the project I don't have any additional machines to run this experiment. Maybe someone else would be willing to give it a try.

haha no worries, it is great that you shared the positive results already

i guess this paper ended up being a success

lucasnewman commented 1 week ago

@JingRH do try the text interpolation strategy if you have some time, if it is a negative result, i'll remove it and wrap up the project I don't have any additional machines to run this experiment. Maybe someone else would be willing to give it a try.

haha no worries, it is great that you shared the positive results already

i guess this paper ended up being a success

I gave it a shot with side-by-side runs and it didn't really provide any significant lift that I saw, maybe slightly worse but it's in the noise.

Screenshot 2024-09-05 at 8 44 52 AM
lucidrains commented 1 week ago

@lucasnewman thank you! 🙏 ok, i'll plan on removing it then and move on to trying rolling diffusion approach at a later date!

Coice commented 1 week ago

@lucidrains I can also confirm the text embeddings are working. I was actually able to synthesize coherent sentences with a ~125mil param multi-speaker model. (I can't share any samples since it is an internal dataset) It seems like earlier on in training it can do short sentences with short reference audio and the longer you train it the better it gets with longer input. My gut tells me there is a shortcut to training this thing; something that speeds up alignment dramatically. What that is, I don't know 🤐

Just a few notes for anyone who is interested: -If you're not getting anything that sounds correct, try with a short 1-2 sec reference audio and a short phrase as the text to synthesize, and generate a few files. -It takes a while to train, especially at scale, I can't think of any other TTS system that takes more training compute? 🤔 -I'd suggest If you're not training a zero shot model or doing research and need TTS, use a derivative of VITS or something else entirely. There is so much randomness in the output, it can give very bad results and broken sentences. (Although my model is not fully trained, so hopefully that happens less often when trained more/fully.) -I didn't use mel spec to train, so I can confirm alternative audio features also work.

All that being said, just from the samples i've generated I can see this model giving SOTA zero-shot performance if trained on a large enough dataset with enough parameters.

What I'm still unsure of is, does E2TTS have superior prosody generation if compared to another model like for example, tortoise with clvp?

lucidrains commented 1 week ago

@Coice so much win

Thank you!

JingRH commented 1 week ago

image I have 600 hours of Mandarin data that I'm using to train a model with phonemes. Does anyone have experience with this? My batch size is 12, and each epoch takes about 23,000 steps. The peak learning rate is set to 9e-5 with 20,000 warmup steps. Could anyone provide feedback on these network parameters and hyperparameter settings? Any suggestions for improvement would be greatly appreciated.

Coice commented 1 week ago

@JingRH Try increasing your learning rate to 3e-4 or 2e-4. I only warmed up 1000 steps. I also disabled the gateloop layers.

JingRH commented 1 week ago

@Coice Great, thank you for your suggestions! Do you think my network parameters (like depth or emb_dim) need any further adjustments?

Coice commented 1 week ago

@JingRH I was using a larger model, depth/heads = 16 and dim= 512. But @lucasnewman saw results with a similar configuration to what you have. (I am assuming 'emb_dim' in your config is equivalent to the transformer dimensions, aka 'dim' in this repo)

lucidrains commented 1 week ago

@JingRH Try increasing your learning rate to 3e-4 or 2e-4. I only warmed up 1000 steps. I also disabled the gateloop layers.

aw, you didn't see anything with the associative scan based layers? some papers claim it is helpful in an audio setting, but i can believe that attention can do everything at this point.

i'll probably remove it if nobody ends up using it

Coice commented 1 week ago

@lucidrains It was an effort to keep the parameter count down and @lucasnewman was seeing results with it off. It may improve model performance, hopefully someone else gives it a shot.

acul3 commented 1 week ago

@lucidrains I can also confirm the text embeddings are working. I was actually able to synthesize coherent sentences with a ~125mil param multi-speaker model. (I can't share any samples since it is an internal dataset) It seems like earlier on in training it can do short sentences with short reference audio and the longer you train it the better it gets with longer input. My gut tells me there is a shortcut to training this thing; something that speeds up alignment dramatically. What that is, I don't know 🤐

Just a few notes for anyone who is interested: -If you're not getting anything that sounds correct, try with a short 1-2 sec reference audio and a short phrase as the text to synthesize, and generate a few files. -It takes a while to train, especially at scale, I can't think of any other TTS system that takes more training compute? 🤔 -I'd suggest If you're not training a zero shot model or doing research and need TTS, use a derivative of VITS or something else entirely. There is so much randomness in the output, it can give very bad results and broken sentences. (Although my model is not fully trained, so hopefully that happens less often when trained more/fully.) -I didn't use mel spec to train, so I can confirm alternative audio features also work.

All that being said, just from the samples i've generated I can see this model giving SOTA zero-shot performance if trained on a large enough dataset with enough parameters.

What I'm still unsure of is, does E2TTS have superior prosody generation if compared to another model like for example, tortoise with clvp?

asking few question if you dont mind

  1. are you using phonemizer , character, or bpe tokenizer?
  2. how many hours your dataset
  3. how many epoch , the zero shot generation got "okay-ish"reasonable result

sorry if it too much to ask thanks

lucasnewman commented 1 week ago

@lucidrains It was an effort to keep the parameter count down and @lucasnewman was seeing results with it off. It may improve model performance, hopefully someone else gives it a shot.

I can try it — I wasn't trying to make a comment on the effectiveness of gateloop earlier. I was just starting with the simplest possible thing to get it working 😅