Closed bfs18 closed 2 months ago
@bfs18 are you restoring the ground truth conditioning at each ode step like this line of code?
@bfs18 also, how is the text conditioning on your front? have you gotten over the hill and have it follow some text?
Hi @lucidrains, I've solved the problem. I believe it was caused by the filled 0s introducing a harsh distribution shift to the Mel condition. So, I fill the mean of the condition Mel instead. My implementation differs from the paper in two ways: 1. I use phonemes rather than characters, and 2. I use a DIT backbone without the UNET skip connection. I obtained the results in just 3 days on the LJSpeech dataset.
https://github.com/user-attachments/assets/33a77839-f53b-43a2-875d-d3ecde5b7a42
@bfs18 congrats on replicating the results! did you see accelerated convergence or did you pay a big price in those 3 days? regardless, all big data and compute will become small over time, and i guess architectural engineering for alignment is going the way of the dodo
@bfs18 somebody here has also tried phonemes and haven't seen anything remarkable, so does this mean you changed the text embed conditioning to be per layer using the DiT ada-ln and ada-zero?
Hi @lucidrains I used only an Nvidia 4090 for the training. There isn't a reliable metric to monitor the training progress; however, around 40% into the training, the audio starts to align with the text. The text and masked Mel spectrogram are fed exclusively into the first DiT layer in my code, as the DiT block is densely integrated with residual connections. Previously, I devoted my efforts to RAD alignment, but it did not yield better results than e2 conditioning."
@bfs18 thank you for sharing this! i'll try to strengthen the conditioning even more in this repo with some DiT specific circuits then. or perhaps improvise an even better design (while sticking with the e2 style conditioning)
Hi @lucidrains I've separated the e2_tts code from my project, and it can generate intelligible results within one or two days of training on an Nvidia 4090 using the LJ Speech dataset. I didn't notice much difference between your implementation and mine. My codebase has been focused on TTS for a long time, so there might be some details I didn't notice but that work effectively. This is the test result after training for 500k steps.
if what i see from your config is with the actual training, mini-batchsize is 1 sample thus 1 time step sampled -> loss of nn 100% comes with this time step -> backpropagation
in e2-tts-pytorch, time steps are actually differently sampled for each audio samples in mini-batch then we just do .mean() -> backpropagation
will that make some difference ? or i get it wrong if you conquered with also large batchsize
Hi @SWivid My mini-batch size is not fixed at 1. I utilize dynamic bucketing, where the batch is constructed from multiple audio samples to achieve a target total length.
Thanks for reply @bfs18 ~ how is the time step, is it same within a mini-batch (with a target max frames) ?
@SWivid No, each sample has its own time step.
@bfs18 that's awesome, thank you for sharing all this valuable information
i've added support for (english) phonemes as well as ada ln zero from the DiT paper, in the case those two details are important in any manner
@bfs18 one idea i've had, now that i'm more certain this paper is going to pan out, is to use the mmdit block (with joint attention) from Esser and Rombach (proven out in both SD3 and Flux by now). this work would then be reduced down to a non-autoregressive VALL-E, but with separate parameters for the two modalities
@lucidrains that is exactly what im trying to do just one thing not sure, it sd3 rf predicting ε or v which has difference during inference
@SWivid i don't have enough experience with RF to really know which one is better yet
but i do have enough intuition with transformers to know that not overlapping the tokens is probably better. it still keeps with the "embarrassingly simple" theme. i'm also a big believer in the joint attention conditioning
@SWivid do you think predicting noise is better? i should add that as an option at this repo, enable more experiments
@SWivid actually, let me just try running a small experiment this morning
@lucidrains I haven't reached a general conclusion yet
As described in the sd3 paper
3.1. Intuitively, however, the resulting velocity prediction target ε − x0 is more difficult for t in the middle of [0, 1]
I guess to predict noise could learn denoising over a wider range of time steps (in the middle) while predicting v is to learn a tough work grasping the skeleton of voice, at time step very close to 0
everything is up to experiments
@SWivid it isn't wise these days to not trust Rombach and his team, as they have an extraordinary track record. but yes, let me throw it into the RF repo for others to compare
one thing for sure, i don't think these unet skip connections are necessary and will let this technique fade out of my memories.
Hi @lucidrains Initially, I tried using mmdit, but the results were not satisfactory. Consequently, I switched to RAD alignment. In hindsight, perhaps I should have given mmdit more time. I believe the mmdit approach can be effective, as the e2 conditioning works well. In the mmdit block, it is intuitive for the left branch to receive the text and the right branch to receive the mel-spectrogram.
@bfs18 hmm yea, it would be strange for mmdit to underperform the current method
did you use separate parameters for text vs audio modalities?
@lucidrains Positional embedding may play a role. I used RoPE on both text and mel without any elaboration. The RoPE of the text is close to the RoPE of the first few mel frames, which negatively influences alignment.
@bfs18 yes, that could be part of the reason
there's a few things to be tried. maybe an e2-mmdit
is in order
@bfs18 you did use separate parameters though or no? (feedforward and attention projections being different for text vs audio)
@lucidrains I used separate parameters. It's like this.
@bfs18 nice! ok, will accept your anecdata :pray:
@lucidrains @SWivid I didn't quite understand the sentence: "Intuitively, however, the resulting velocity prediction target ε − x0 is more difficult for t in the middle of [0, 1], since for t = 0, the optimal prediction is the mean of p1, and for t = 1 the optimal prediction is the mean of p0."
when I was reading the paper. Could you explain it? According to my understanding, the optimal velocity prediction should always be ε (Gaussian noise) − x0 (data). Why is it different at t=0 and t=1 in this sentence? Thank you in advance.
as you approach 0 or 1, you see either all noise, or all data. there's not much to predict, in other words
@SWivid predict noise is looking good (7k steps for oxford flowers)! took me a few tries; i figured out that the flow needs to be clipped in order for sampling to work well
@lucidrains great!
have you take a term to ε-prediction loss ? as in paper:
as i go through rectified_flow and diffusers repo of sd3, the term is not considered in loss calcul
or say, it is this term the evil one blocking e2-tts training adding this term in our case (backward denoising), is multiply with 1/t^2, which is delicate to learn as 1/t^2 quite large approching t=0
@SWivid no, i haven't done the proper loss weighting just yet
will get to it
going to close this issue, as it is technically resolved
@bfs18 i'm just curious, are you a researcher, phd student, independent? just so i get some context
Hi @lucidrains I work at a company, and my job involves Text-to-Speech.
@bfs18 ah got it 👍
I implemented an E2-TTS conditioning mechanism in my framework, which produces intelligible audio overall. However, the regions where ground truth Mel-spectrograms are provided result in unintelligible audio. I am curious whether this issue is due to a bug in my implementation or if your model encounters a similar problem. The problematic regions are specifically where the ground truth Mel-spectrograms are used in the following example.
gt
pred

gt wave
pred wave