Does your model generate acceptable audio at masked region?

bfs18 commented 2 months ago

I implemented an E2-TTS conditioning mechanism in my framework, which produces intelligible audio overall. However, the regions where ground truth Mel-spectrograms are provided result in unintelligible audio. I am curious whether this issue is due to a bug in my implementation or if your model encounters a similar problem. The problematic regions are specifically where the ground truth Mel-spectrograms are used in the following example.

gt media_images_valid_media_mel_in_870931_c84c6e16e846eeba17c4 pred media_images_valid_media_mel_hat_870931_9952248f6eba402d9536

gt wave

pred wave

lucidrains commented 2 months ago

@bfs18 are you restoring the ground truth conditioning at each ode step like this line of code?

lucidrains commented 2 months ago

@bfs18 also, how is the text conditioning on your front? have you gotten over the hill and have it follow some text?

bfs18 commented 2 months ago

Hi @lucidrains, I've solved the problem. I believe it was caused by the filled 0s introducing a harsh distribution shift to the Mel condition. So, I fill the mean of the condition Mel instead. My implementation differs from the paper in two ways: 1. I use phonemes rather than characters, and 2. I use a DIT backbone without the UNET skip connection. I obtained the results in just 3 days on the LJSpeech dataset.

https://github.com/user-attachments/assets/33a77839-f53b-43a2-875d-d3ecde5b7a42

lucidrains commented 2 months ago

@bfs18 congrats on replicating the results! did you see accelerated convergence or did you pay a big price in those 3 days? regardless, all big data and compute will become small over time, and i guess architectural engineering for alignment is going the way of the dodo

lucidrains commented 2 months ago

@bfs18 somebody here has also tried phonemes and haven't seen anything remarkable, so does this mean you changed the text embed conditioning to be per layer using the DiT ada-ln and ada-zero?

bfs18 commented 2 months ago

Hi @lucidrains I used only an Nvidia 4090 for the training. There isn't a reliable metric to monitor the training progress; however, around 40% into the training, the audio starts to align with the text. The text and masked Mel spectrogram are fed exclusively into the first DiT layer in my code, as the DiT block is densely integrated with residual connections. Previously, I devoted my efforts to RAD alignment, but it did not yield better results than e2 conditioning."

lucidrains commented 2 months ago

@bfs18 thank you for sharing this! i'll try to strengthen the conditioning even more in this repo with some DiT specific circuits then. or perhaps improvise an even better design (while sticking with the e2 style conditioning)

bfs18 commented 2 months ago

Hi @lucidrains I've separated the e2_tts code from my project, and it can generate intelligible results within one or two days of training on an Nvidia 4090 using the LJ Speech dataset. I didn't notice much difference between your implementation and mine. My codebase has been focused on TTS for a long time, so there might be some details I didn't notice but that work effectively. This is the test result after training for 500k steps.

SWivid commented 2 months ago

if what i see from your config is with the actual training, mini-batchsize is 1 sample thus 1 time step sampled -> loss of nn 100% comes with this time step -> backpropagation

in e2-tts-pytorch, time steps are actually differently sampled for each audio samples in mini-batch then we just do .mean() -> backpropagation

will that make some difference ? or i get it wrong if you conquered with also large batchsize

bfs18 commented 2 months ago

Hi @SWivid My mini-batch size is not fixed at 1. I utilize dynamic bucketing, where the batch is constructed from multiple audio samples to achieve a target total length.

SWivid commented 2 months ago

Thanks for reply @bfs18 ~ how is the time step, is it same within a mini-batch (with a target max frames) ?

bfs18 commented 2 months ago

@SWivid No, each sample has its own time step.

lucidrains commented 2 months ago

@bfs18 that's awesome, thank you for sharing all this valuable information

i've added support for (english) phonemes as well as ada ln zero from the DiT paper, in the case those two details are important in any manner

lucidrains commented 2 months ago

@bfs18 one idea i've had, now that i'm more certain this paper is going to pan out, is to use the mmdit block (with joint attention) from Esser and Rombach (proven out in both SD3 and Flux by now). this work would then be reduced down to a non-autoregressive VALL-E, but with separate parameters for the two modalities

SWivid commented 2 months ago

@lucidrains that is exactly what im trying to do just one thing not sure, it sd3 rf predicting ε or v which has difference during inference

lucidrains commented 2 months ago

@SWivid i don't have enough experience with RF to really know which one is better yet

but i do have enough intuition with transformers to know that not overlapping the tokens is probably better. it still keeps with the "embarrassingly simple" theme. i'm also a big believer in the joint attention conditioning

lucidrains commented 2 months ago

@SWivid do you think predicting noise is better? i should add that as an option at this repo, enable more experiments

lucidrains commented 2 months ago

@SWivid actually, let me just try running a small experiment this morning

SWivid commented 2 months ago

@lucidrains I haven't reached a general conclusion yet

As described in the sd3 paper

3.1. Intuitively, however, the resulting velocity prediction target ε − x0 is more difficult for t in the middle of [0, 1]

I guess to predict noise could learn denoising over a wider range of time steps (in the middle) while predicting v is to learn a tough work grasping the skeleton of voice, at time step very close to 0

everything is up to experiments

lucidrains commented 2 months ago

@SWivid it isn't wise these days to not trust Rombach and his team, as they have an extraordinary track record. but yes, let me throw it into the RF repo for others to compare

lucidrains commented 2 months ago

one thing for sure, i don't think these unet skip connections are necessary and will let this technique fade out of my memories.

bfs18 commented 2 months ago

Hi @lucidrains Initially, I tried using mmdit, but the results were not satisfactory. Consequently, I switched to RAD alignment. In hindsight, perhaps I should have given mmdit more time. I believe the mmdit approach can be effective, as the e2 conditioning works well. In the mmdit block, it is intuitive for the left branch to receive the text and the right branch to receive the mel-spectrogram.

lucidrains commented 2 months ago

@bfs18 hmm yea, it would be strange for mmdit to underperform the current method

did you use separate parameters for text vs audio modalities?

bfs18 commented 2 months ago

@lucidrains Positional embedding may play a role. I used RoPE on both text and mel without any elaboration. The RoPE of the text is close to the RoPE of the first few mel frames, which negatively influences alignment.

lucidrains commented 2 months ago

@bfs18 yes, that could be part of the reason

there's a few things to be tried. maybe an e2-mmdit is in order

lucidrains commented 2 months ago

@bfs18 you did use separate parameters though or no? (feedforward and attention projections being different for text vs audio)

bfs18 commented 2 months ago

@lucidrains I used separate parameters. It's like this.

lucidrains commented 2 months ago

@bfs18 nice! ok, will accept your anecdata :pray:

bfs18 commented 2 months ago

@lucidrains @SWivid I didn't quite understand the sentence: "Intuitively, however, the resulting velocity prediction target ε − x0 is more difficult for t in the middle of [0, 1], since for t = 0, the optimal prediction is the mean of p1, and for t = 1 the optimal prediction is the mean of p0." when I was reading the paper. Could you explain it? According to my understanding, the optimal velocity prediction should always be ε (Gaussian noise) − x0 (data). Why is it different at t=0 and t=1 in this sentence? Thank you in advance.

lucidrains commented 2 months ago

as you approach 0 or 1, you see either all noise, or all data. there's not much to predict, in other words

lucidrains commented 2 months ago

results 7500

@SWivid predict noise is looking good (7k steps for oxford flowers)! took me a few tries; i figured out that the flow needs to be clipped in order for sampling to work well

SWivid commented 2 months ago

@lucidrains great! have you take a term to ε-prediction loss ? as in paper:

as i go through rectified_flow and diffusers repo of sd3, the term is not considered in loss calcul

or say, it is this term the evil one blocking e2-tts training adding this term in our case (backward denoising), is multiply with 1/t^2, which is delicate to learn as 1/t^2 quite large approching t=0

lucidrains commented 2 months ago

@SWivid no, i haven't done the proper loss weighting just yet

will get to it

lucidrains commented 2 months ago

going to close this issue, as it is technically resolved

lucidrains commented 2 months ago

@bfs18 i'm just curious, are you a researcher, phd student, independent? just so i get some context

bfs18 commented 2 months ago

Hi @lucidrains I work at a company, and my job involves Text-to-Speech.

lucidrains commented 2 months ago

@bfs18 ah got it 👍

lucidrains / e2-tts-pytorch

Does your model generate acceptable audio at masked region? #25