Possible to manipulate text projection to elongate phonemes controllably? [Question]

shivammehta25 / Matcha-TTS

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching

https://shivammehta25.github.io/Matcha-TTS/

MIT License

623 stars 79 forks source link

Possible to manipulate text projection to elongate phonemes controllably? [Question] #39

Closed artificalaudio closed 5 months ago

artificalaudio commented 8 months ago

I'm reading through the paper, and I'm wondering if during inference time, could you manipulate the duration predictor, or some other part to allow controllable elongation of certain phonemes?

Could you somehow interfere with this at/before inference time to allow adaption for singing? text2singing for instance. I tried writing "I'm Siiiinging in the rain" to attempt to elongate the generated sound, but looses intelligibility.

I'm wondering if this technique could be adapted to text2singing, could take some inspiration from RVC, and potentially train on an F0 contour as well. So the input would be F0 + Phones/Text projection. F0 is used as an additional conditioning signal to the text. Could this work?

shivammehta25 commented 8 months ago

Hello! Thank you for your interest in 🍵 Matcha-TTS.

I'm wondering if this technique could be adapted to text2singing, could take some inspiration from RVC, and potentially train on an F0 contour as well. So the input would be F0 + Phones/Text projection. F0 is used as an additional conditioning signal to the text. Could this work?

It is a great idea and it should work. This seems to be an additional conditioning to the flow matching process and might even improve the fewer NFE generation further.

Could you somehow interfere with this at/before inference time to allow adaption for singing? text2singing for instance. I tried writing "I'm Siiiinging in the rain" to attempt to elongate the generated sound, but looses intelligibility.

This is because the current front-end does not take this into account and it will generate multiple phones for it, while it was not trained for something like this. I suggest either updating the text to act like this (if you want to do some annotations) or like you mentioned adapting and taking inspirations from the text2singing paradigm .

Hope this helps :)

artificalaudio commented 8 months ago

This is because the current front-end does not take this into account and it will generate multiple phones for it, while it was not trained for something like this. I suggest either updating the text to act like this (if you want to do some annotations) or like you mentioned adapting and taking inspirations from the text2singing paradigm .

Hope this helps :)

This helps a lot! Thank you very much. There isn't a *(neural) text2singing paradigm (yet!), there's RVC is audio2audio cloning, which uses Hubert embeddings and F0 signal as conditioner.

I suggest either updating the text to act like this (if you want to do some annotations)

could you elaborate on this, so this could be possible with retraining with some extra text/phone annotations to denote held sounds?

I wonder what would happen if Matcha was trained from scratch with annotated singing examples. Or fine tuned on singing examples.

This is something I can't quite figure out with the MAS, you don't align the text to speech yourself. So I would just be able to input lyrics and audio pairs? This is what I can't figure out if I added an additional F0 contour, would I have to align the F0 to the text projection manually? (Or for instance could the correlation be learned)

shivammehta25 commented 8 months ago

could you elaborate on this, so this could be possible with retraining with some extra text/phone annotations to denote held sounds?

Unfortunately No, I meant to annotate the training dataset and retrain the system. Precisely what you said after

I wonder what would happen if Matcha was trained from scratch with annotated singing examples. Or fine-tuned on singing examples.

but that's the best part! It is not super expensive to re-train 🍵 Matcha-TTS. And can even be done on consumer-grade GPUs.

This is what I can't figure out if I added an additional F0 contour, would I have to align the F0 to the text projection manually? (Or for instance could the correlation be learned)

I think what you are referring to is already implemented in FastPitch v1.1 and can be learned.

artificalaudio commented 8 months ago

could you elaborate on this, so this could be possible with retraining with some extra text/phone annotations to denote held sounds?

Unfortunately No, I meant to annotate the training dataset and retrain the system. Precisely what you said after

I wonder what would happen if Matcha was trained from scratch with annotated singing examples. Or fine-tuned on singing examples.

but that's the best part! It is not super expensive to re-train 🍵 Matcha-TTS. And can even be done on consumer-grade GPUs.

This is what I can't figure out if I added an additional F0 contour, would I have to align the F0 to the text projection manually? (Or for instance could the correlation be learned)

I think what you are referring to is already implemented in FastPitch v1.1 and can be learned.

Ok thank you very much for this information. I'm going to try this I've found a couple of paired datasets of lyrics and vocals. I'll try the normal way without editing the architecture, or adding in F0, and if needed add it in there.

There's another thing about the mechanics of this repo, are you actually generating MelSpecs? (As I saw the optional vocoder part for onnx export, either one outputs mel specs, or add in neural vocoder and output real audio).

What confuses me is the paper mentions using a 1D unet instead of a 2D unet, so are you using this model to go from text to image, but the image is a spectrogram? (Then synth from spectrogram?)

ghenter commented 8 months ago

What confuses me is the paper mentions using a 1D unet instead of a 2D unet, so are you using this model to go from text to image, but the image is a spectrogram? (Then synth from spectrogram?)

We do not treat mel-spectrograms as 2D images, but as a 1D sequence of 80-dimensional vectors. This works well and saves memory.

Let me try to explain the way I understand it in a bit more in-depth:

Conceptually, and ignoring the batch dimension, RGB images are represented as 3D tensors, say W x H x C, where C (the number of channels) equals 3. Convolution is applied to the W and H dimensions (so a 2D CNN), but not over the channels C. Mel-spectrograms can be represented as 2D tensors T x C, where T is the number of frames (varies with audio duration) and C is the number of "channels" in the spectrogram (nearly always 80).

Grad-TTS treats mel-spectrograms as greyscale images measuring T x C x 1 and applies a 2D CNN, with convolution along both T and C. Assuming that F different filter kernels are used, along with zero-padding, my understanding is that the output measures T x C x F, a 3D tensor.

In contrast, Matcha-TTS treats the T x C mel-spectrograms as a sequence of T vectors, each with dimension C, and only ~~runs convolution~~ assumes translation invariance along T (like a 1D CNN would). In a sense, Matcha-TTS sees mel-spectrograms as images with a height of 1 pixel but that have many channels per pixel, something like T x 1 x C. The output resulting from a 1D convolution measures T x F, which is a 2D tensor; with a Transformer on a 1D sequence, we typically have F = C, such that the output measures T x C. Such 2D tensors save memory over the 3D tensors used by Grad-TTS.

Finally, note that both the T x C x F and T x F tensors above only constitute activations inside a U-Net decoder. In all cases, the final decoder output has the same size T x C as its mel-spectrogram input, so that the system always returns a mel-spectrogram in the end.

I hope this helps!

Edit, three days later: Corrected unclear statements (as indicated by strikeout) incorrectly suggesting that the Matcha-TTS decoder is a CNN.

ghenter commented 8 months ago

There isn't a *(neural) text2singing paradigm (yet!), there's RVC is audio2audio cloning, which uses Hubert embeddings and F0 signal as conditioner.

F0 is used as an additional conditioning signal to the text.

To me, it appears that there are significant similarities between what you are proposing and the area of singing voice synthesis (sometimes abbreviated SVS), where the input is text and a music score, and the output is audio of the text being sung (as opposed to being spoken). You can find a bunch of SVS papers using this Google Scholar search.

SVS seems similar to your idea of text-to-singing, and virtually identical if F0 control is included (depending on how the F0 control input is represented). I do not really follow SVS work, so I am not aware whether or not a neural TTS system has been trained on singing audio but without a music score (so without F0 conditioning, meaning that the input indeed only is text), but it seems likely.

artificalaudio commented 8 months ago

We do not treat mel-spectrograms as 2D images, but as a 1D sequence of 80-dimensional vectors. This works well and saves memory.

Conceptually, and ignoring the batch dimension, RGB images are represented as 3D tensors, say W x H x C, where C (the number of channels) equals 3. Convolution is applied to the W and H dimensions (so a 2D CNN), but not over the channels C. Mel-spectrograms can be represented as 2D tensors T x C, where T is the number of frames (varies with audio duration) and C is the number of "channels" in the spectrogram (nearly always 80).

In contrast, Matcha-TTS treats the T x C mel-spectrograms as a sequence of T vectors, each with dimension C, and only runs convolution along T (a 1D CNN). In a sense, Matcha-TTS sees mel-spectrograms as images with a height of 1 pixel but that have many channels per pixel, something like T x 1 x C. The output resulting from the 1D convolution measures T x F, which is a 2D tensor. This saves memory over the 3D tensors used by Grad-TTS.

Finally, note that both the T x C x F and T x F tensors above only constitute activations inside the U-Net decoder. In all cases, the final decoder output has the same size T x C as its mel-spectrogram input, so that the system always returns a mel-spectrogram in the end.

I hope this helps!

This is stupidly helpful, I can't even to thank you enough for taking the time to explain this. Diffusion models without attention spring to mind, using the SSM/S4 type models, they're apparently really great for time series 1D data. So would it be possible to adapt the matcha architecture to use SSM, selective state space models? (Not something I think I can do myself, but it's been a lingering wonder!)

And the other thought is why spectrograms, would this work with Hubert embeddings? Treat similarly to the way you interpret mel specs, just make 1D sequence of 256/768 dimensional vectors (whatever the fixed length of hubert embeddings you use). Then use a neural source filter to synthesise the sound (instead of mel vocoder/hifigan), if you use variation of wav2vec/Hubert called ContentVec, then you have speaker disentanglement, can fine tune the NSF head to any voice you want. (This is something I think I could do myself as it's changing the pipeline slightly, and tweaking 1 value) Can you see any gotchas as to why this might not be feasible? This way you train on one person's voice, but can generalise to any speaker by fine tuning the head.

ghenter commented 8 months ago

In contrast, Matcha-TTS treats the T x C mel-spectrograms as a sequence of T vectors (...) and only runs convolution along T (a 1D CNN)

I'm sorry, but I realise that some of what I wrote above came out wrong, or at the very least is likely to be misunderstood. I will try to clarify my understanding a bit better:

Matcha-TTS does run upsampling and downsampling along the T dimension (operations that perhaps can be expressed as convolutions), but it is a Transformer rather than a CNN model. As such, the decoder is a model with rather than without attention. The main takeaway should be that Matcha-TTS assumes translation invariance along the T dimension (translation invariance being the same assumption that powers CNNs as well), but – in contrast to Grad-TTS – does not do so along the C dimension.

shivammehta25 commented 8 months ago

This is stupidly helpful, I can't even to thank you enough for taking the time to explain this. Diffusion models without attention spring to mind, using the SSM/S4 type models, they're apparently really great for time series 1D data. So would it be possible to adapt the matcha architecture to use SSM, selective state space models? (Not something I think I can do myself, but it's been a lingering wonder!)

Ideally, you should be able to replace the transformer with S4 type architecture like Mamba. But I don't think the speed gain would be that significant for mel-spectrogram synthesis as mel-spectrogram tends to be small already. However if one wants to make a waveform synthesis network then it might be something one can try, might benefit from the linear-time sequence modelling then.

And the other thought is why spectrograms, would this work with Hubert embeddings? Treat similarly to the way you interpret mel specs, just make 1D sequence of 256/768 dimensional vectors (whatever the fixed length of hubert embeddings you use). Then use a neural source filter to synthesise the sound (instead of mel vocoder/hifigan), if you use variation of wav2vec/Hubert called ContentVec, then you have speaker disentanglement, can fine tune the NSF head to any voice you want. (This is something I think I could do myself as it's changing the pipeline slightly, and tweaking 1 value) Can you see any gotchas as to why this might not be feasible? This way you train on one person's voice, but can generalise to any speaker by fine tuning the head.

This is a great idea and yes! You can choose any representation as the output representation from 🍵 Matcha-TTS, you can even make it generate Encodec vectors and then use their decoder to generate waveforms (having a large pertaining might be better for generalisation to different scenarios and sounds, similar to what you've proposed).

ghenter commented 8 months ago

And the other thought is why spectrograms, would this work with Hubert embeddings? Treat similarly to the way you interpret mel specs, just make 1D sequence of 256/768 dimensional vectors (whatever the fixed length of hubert embeddings you use). Then use a neural source filter to synthesise the sound (instead of mel vocoder/hifigan)

There's quite a bit of interest in TTS right now for replacing (mel-)spectrograms with learnt representations of audio/acoustics. Doing so seems to enable better speech synthesis, and the artefacts are also qualitatively different from those you get with the conventional approach. Myself, I have been a co-author on two published papers exploring such learnt representations of audio/acoustics, specifically for applications to spontaneous speech. You will find those two articles linked here, in case they may be of use to you:

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS (IEEE ICASSP SASB Workshop 2023)
On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis (ISCA SSW 2023)

artificalaudio commented 8 months ago

This is a great idea and yes! You can choose any representation as the output representation from 🍵 Matcha-TTS, you can even make it generate Encodec vectors and then use their decoder to generate waveforms (having a large pertaining might be better for generalisation to different scenarios and sounds, similar to what you've proposed).

Ok you’re actually blowing my mind a bit here! You can use encodec tokens as well. Have you tried this? I’m very interested in trying this, and might help me understand the pipeline to be able to swap for Hubert. Encodec’s output is pretty simple, you’re treating the 2D codebook like you would a spectrogram. Instead of 80 channels, you’d have the depth of the codebooks?

So I’d just change this number for instance, the number of features you’re predicting? https://github.com/shivammehta25/Matcha-TTS/blob/256adc55d3219053d2d086db3f9bd9a4bde96fb1/configs/model/matcha.yaml#L12

Convert dataset to Encodec tokens. Or replace this part here with get encodec/hubert representation instead: https://github.com/shivammehta25/Matcha-TTS/blob/256adc55d3219053d2d086db3f9bd9a4bde96fb1/matcha/data/text_mel_datamodule.py#L172

And swap the head/decoder, instead of synthesising though hifigan? I need to track down that part, either to swap for NSF/or encodec token decoder. Am I right in I just need to change the features, hack the way the datasets made and change the decoder? It's the synthesising at the output part of training I'm not entirely sure where it is.

(If I search hifigan or vocoder, getting hits but in app.py/cli.py), I'm looking in (meldataset.py, or text_mel_datamodule.py ?) Do I hack this function to allow for encodec tokens, bit shaky on this part: https://github.com/shivammehta25/Matcha-TTS/blob/main/matcha/models/matcha_tts.py#L74 There's the hifigan folder, and models, but I'm expecting something at the end of training script that I can hack/swap out the decoder from mels to audio, and instead decode encodec tokens. Am I looking in the right places?

And to round this off for "elongation of phonemes", according to this SVS paradigm, you'd put a time lag model before the duration model: (https://nnsvs.github.io/overview.html). I'm getting ahead of myself there, but that would round the system off for using Matcha as the basis for a neural singing voice synth. Lag + F0, SSL features into Matcha, sounds like a fun investigation!

artificalaudio commented 8 months ago

And the other thought is why spectrograms, would this work with Hubert embeddings? Treat similarly to the way you interpret mel specs, just make 1D sequence of 256/768 dimensional vectors (whatever the fixed length of hubert embeddings you use). Then use a neural source filter to synthesise the sound (instead of mel vocoder/hifigan)

There's quite a bit of interest in TTS right now for replacing (mel-)spectrograms with learnt representations of audio/acoustics. Doing so seems to enable better speech synthesis, and the artefacts are also qualitatively different from those you get with the conventional approach. Myself, I have been a co-author on two published papers exploring such learnt representations of audio/acoustics, specifically for applications to spontaneous speech. You will find those two articles linked here, in case they may be of use to you:

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS (IEEE ICASSP SASB Workshop 2023)

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis (ISCA SSW 2023)

OK this absolutely perfect, and it's been a pleasure speaking to you both, and thank you for being so gracious to accommodate this mind wonder! I've read both of the papers I love them both!! And really love the wavthruVec paper, I was annoyed when I first found that, dammit they beat me to it! This is exactly what I'm interested in researching down, using SSL features.

I think the NSF stuff sounds a bit smoother for voice than hifigans, so this is will be the first thing I try to pull off if I can figure out how to swap the decoder/head for something else. Thinking how exactly to treat F0 in conjunction with the SSL embeddings, F0 is 1d and can be time aligned to the SSL embeddings, in the context of this matcha model, just add as an extra single dimension, output of w2v or hubert, say it's 256, plus F0, you predict 257 channels, F0 is just added ontop/with the SSL features. (I don't have any intuition here, work that work, or bad idea?)

shivammehta25 commented 8 months ago

Ok you’re actually blowing my mind a bit here! You can use encodec tokens as well. Have you tried this? I’m very interested in trying this, and might help me understand the pipeline to be able to swap for Hubert. Encodec’s output is pretty simple, you’re treating the 2D codebook like you would a spectrogram. Instead of 80 channels, you’d have the depth of the codebooks? So I’d just change this number for instance, the number of features you’re predicting?

Exactly! This is all precisely that needs to be done. 80-dim Mel-Spectrogram is just a representation we used, the main reason for this is to compare it with other architectures and use a off-the-shelf vocoder. But it is not a restriction one can freely choose any other representation to do so like Encodec vectors. I know that @p0p4k has tried that.

Convert dataset to Encodec tokens. Or replace this part here with get encodec/hubert representation instead: And swap the head/decoder, instead of synthesising though hifigan? I need to track down that part, either to swap for NSF/or encodec token decoder. Am I right in I just need to change the features, hack the way the datasets made and change the decoder? It's the synthesising at the output part of training I'm not entirely sure where it is.

This is precisely how I would also approach this issue.

(If I search hifigan or vocoder, getting hits but in app.py/cli.py), I'm looking in (meldataset.py, or text_mel_datamodule.py ?) Do I hack this function to allow for encodec tokens, bit shaky on this part: https://github.com/shivammehta25/Matcha-TTS/blob/main/matcha/models/matcha_tts.py#L74 There's the hifigan folder, and models, but I'm expecting something at the end of training script that I can hack/swap out the decoder from mels to audio, and instead decode encodec tokens. Am I looking in the right places?

You do not need to hack this function, Matcha-TTS being an acoustic model only generates acoustic representation (80 dims mel-spectrogram for the OG model) and for your case it would be other representations like HuBERT/Encodec tokens etc. The only part you will need to hack is, wherever a vocoder is used notice it takes output['mel'] as an input, just replace the vocoder with something that can transform your intermediate acoustic vector representations to waveform (fine-tune a vocoder for those like the papers mentioned above or use Decoder of Encodec if you are using encoded vectors).

artificalaudio commented 8 months ago

Exactly! This is all precisely that needs to be done. 80-dim Mel-Spectrogram is just a representation we used, the main reason for this is to compare it with other architectures and use a off-the-shelf vocoder. But it is not a restriction one can freely choose any other representation to do so like Encodec vectors. I know that @p0p4k has tried that.

Right well thank you so much. I've made all the tweaks, and actually running the generate stats part now. This is just to note, whether this is usual or not. I've swapped for a get ssl function, instead of get mel, and run the stats, it's estimating ~2hrs for 2 workers on V100 on collab, just for calculating the mean and std.

And there's also something bugging me about the get mel function. I've largely just copied and tweaked, get embeddings, and still run through the normalisation process, like in mels: ssl = normalize(ssl, self.data_parameters["mel_mean"], self.data_parameters["mel_std"])

My question/wonder, the stats script seems like it's calling this function, either get_mel normally, now get_sll. At the end of get_ssl I'm normalising to parameters that are already on the yaml file, yet the point of this process is to copy and replace these numbers in the config file? Surely they're being run through the normalisation of the old numbers? (if I just run matcha-data-stats -i ljspeech.yaml) And ljspeech already has this defined:

data_statistics: # Computed for ljspeech dataset mel_mean: -5.536622 mel_std: 2.116101

I should just let the process finish and see what happens, but just looking at the code, seems like I could be doing something wrong here to begin with.

You do not need to hack this function, Matcha-TTS being an acoustic model only generates acoustic representation (80 dims mel-spectrogram for the OG model) and for your case it would be other representations like HuBERT/Encodec tokens etc.

So I think this was the big click for me, in the training pipeline I was expecting the mels to be converted to wav through a hifiGan, and the loss was taken from comparing actual audio with audio. We're not doing that here, we're inputting mels, and outputting mels, and comparing these representations, not the actual audio itself. So therefore the training pipe doesn't need a vocoder at all.

p0p4k commented 8 months ago

@artificalaudio we are not even comparing the mels in here. That's what makes cfm different, we just compare trajectories from noise to the desired data point (Mel in this case).

p0p4k commented 8 months ago

For normalizing stats, you can try ignoring it and set mean 0 and std 1.

p0p4k commented 8 months ago

One more point, directly comparing wavs is a poor idea cause shear size of data points (1sec = 22k points). Hence even if you get a WAV prediction, it will be converted to Mel_pred for loss calculations.

shivammehta25 commented 8 months ago

Right well thank you so much. I've made all the tweaks, and actually running the generate stats part now. This is just to note, whether this is usual or not. I've swapped for a get ssl function, instead of get mel, and run the stats, it's estimating ~2hrs for 2 workers on V100 on collab, just for calculating the mean and std.

If it is too much you can skip the normalisation part, I think it would work fine if not better. In the matcha-data-stats script no matter what the values you have already we reset them to 0, 1

By first making them None in https://github.com/shivammehta25/Matcha-TTS/blob/256adc55d3219053d2d086db3f9bd9a4bde96fb1/matcha/utils/generate_data_statistics.py#L92

and then when the dataloaders receives None it resets it to 0 mean and 1 std

https://github.com/shivammehta25/Matcha-TTS/blob/256adc55d3219053d2d086db3f9bd9a4bde96fb1/matcha/data/text_mel_datamodule.py#L149-L152

I should just let the process finish and see what happens, but just looking at the code, seems like I could be doing something wrong here to begin with.

So that is already taken care of, you should be fine.

So I think this was the big click for me, in the training pipeline I was expecting the mels to be converted to wav through a hifiGan, and the loss was taken from comparing actual audio with audio. We're not doing that here, we're inputting mels, and outputting mels, and comparing these representations, not the actual audio itself. So therefore the training pipe doesn't need a vocoder at all.

Precisely what @p0p4k said, waveform surely has more information but also has a lot of datapoints to work with. The GPU requirement for such will be humongous. So, we prefer breaking the problem into two parts. First, generate mel-spectrograms or any other intermediate audio representation and then, secondly a vocoder that takes this intermediate audio representation to a waveform.

artificalaudio commented 8 months ago

@artificalaudio we are not even comparing the mels in here. That's what makes cfm different, we just compare trajectories from noise to the desired data point (Mel in this case).

Right well this is blowing my mind even more. At the start of the year I made it my mission to understand flow models as the next part of my learning, this package if perfect for trying to get a feel for what's happening.

I'm still a bit shaky on diffusion, and ML in general. That aside, I've read you have a VAE and Unet in stable diffusion. How do I think about the conceptual change here with using flows. What I know about Flows is you have an encoder, and a decoder, you only need to train encoder, and invert to get the decoder. There's no compression like a VAE, so 1:1 mapping into z.

How exactly this fits in with diffusion/Optimal transport, and conditional flow matching is still out of my understanding. I'm wondering for instance, is the noise shape 80 wide, to match the mel bins, and satisfy the flow 1:1 mapping needs? You're mapping a flow/path from Noise to intended outcome, you call it trajectories from this noise to datapoint. Sorry if it's a silly question, I'm just wondering how to think about the shape of this noise.

artificalaudio commented 8 months ago

Right well thank you so much. I've made all the tweaks, and actually running the generate stats part now. This is just to note, whether this is usual or not. I've swapped for a get ssl function, instead of get mel, and run the stats, it's estimating ~2hrs for 2 workers on V100 on collab, just for calculating the mean and std.

If it is too much you can skip the normalisation part, I think it would work fine if not better. In the matcha-data-stats script no matter what the values you have already we reset them to 0, 1

By first making them None in

https://github.com/shivammehta25/Matcha-TTS/blob/256adc55d3219053d2d086db3f9bd9a4bde96fb1/matcha/utils/generate_data_statistics.py#L92

and then when the dataloaders receives None it resets it to 0 mean and 1 std

https://github.com/shivammehta25/Matcha-TTS/blob/256adc55d3219053d2d086db3f9bd9a4bde96fb1/matcha/data/text_mel_datamodule.py#L149-L152

I should just let the process finish and see what happens, but just looking at the code, seems like I could be doing something wrong here to begin with.

So that is already taken care of, you should be fine.

So I think this was the big click for me, in the training pipeline I was expecting the mels to be converted to wav through a hifiGan, and the loss was taken from comparing actual audio with audio. We're not doing that here, we're inputting mels, and outputting mels, and comparing these representations, not the actual audio itself. So therefore the training pipe doesn't need a vocoder at all.

Precisely what @p0p4k said, waveform surely has more information but also has a lot of datapoints to work with. The GPU requirement for such will be humongous. So, we prefer breaking the problem into two parts. First, generate mel-spectrograms or any other intermediate audio representation and then, secondly a vocoder that takes this intermediate audio representation to a waveform.

Ok brilliant, just wanted to double check I'm not wasting time here. I'll let it finish now with an eased mind!

And perfect that makes a lot of sense, audio is huge in terms of samples!

I guess the other lingering question is when I extract the ssl representations, and squeeze, I get a Tensor of (256, nFrames). Just double checking I don't need to reshape, and your get mel would return say (80, nFrames). Is that the right format, or would I need to do (nframes, 256)?

Everything else I think I'm golden and ready to rock!

artificalaudio commented 8 months ago

Update:

Totally works. Surprisingly well. Didn't have to train for very long either. Saw the other post with the tensorboard results and thought I had to train to 1M steps. I trained to 17k, possibly could have learned sooner with these reps; as I only checked once I'd replaced the head earlier today.

Early experiments show I can control elongation of phones via manipulating the embeddings manually pre vocoding, as they represent time slices. The limitation is fractional control isn't possible currently. The rate parameter might come in quite handy if it could vary over time and be synced to phones.

I will try training a pure 1 feat F0 model now, or Monday as I'm getting the Champagne out! The thinking goes, can you have two separate models, you run twice to get predictions, instead of trying to predict both embedds+F0 in same model. Although another experiment would just be cat the F0 onto the embeddings and see what happens, move from 256 to 257 features. Would that become more difficult to train, would be interesting to see what happens at 17k steps with that idea, whether prosody can emerge that way.

For svs I assume to extend one could try increasing vocab, training on singing data that emulates the format of ljspeech, but encodes differently, phones+duration tokens/identifiers might be one way to go.

Lots of things to explore! Finer details, anyone reproducing (before I've even manage to write a paper!), do what Shivam suggested and pre-process before hand the new representations, will save ages in training.

shivammehta25 commented 5 months ago

I am happy to hear the results you've shared with me personally :D Exciting!! I will close this issue for now, it has been lingering around for a while. Please feel free to reopen if anyone feels more things need to be discussed. Regards, Shivam