Add training support for EnCodec

ArthurZucker commented 1 year ago

Feature request

Would be cool to add training support for the EnCodec model. Not entirely sure if we can easily make it compatible with Trainer, so this can be a good second issue I think.

Motivation

…

Your contribution

…

Swastyy commented 1 year ago

Hi @ArthurZucker , I want to try this, please assign it to me. Thanks.

ArthurZucker commented 1 year ago

Sure! Feel free to open a PR and ping me

hackyon commented 1 year ago

@Swastyy @ArthurZucker Let me know if you are looking for any support. I would also like to help with this if possible. Thanks!

ArthurZucker commented 1 year ago

Seems like he did not link a PR, feel free to synch and ping me for any help! Even a draft is good!

rishabbala commented 1 year ago

Hi @ArthurZucker can you let me know of the overall changes that have to be made. I see the EnCodec model already implemented in transformers, so to integrate it with Trainer what are the additional requirements?

ArthurZucker commented 1 year ago

The idea is mostly to integrate the loss computation for the VQVAE! Trainer might not work as the model does not use attention, but the target should be to have the same loss as the original model !

hackyon commented 1 year ago

Thanks Arthur.

I read through the paper (https://arxiv.org/pdf/2210.13438.pdf) and the existing code, and here is my impression on the work breakdown. Does any of this make sense or am I going in a totally wrong direction?

The loss function detailed in the paper (equation 4) is a combination of (1) the reconstruction loss (over frequency and time domains), (2) the discriminative loss (requires the discriminator), and (3) the VQ commitment loss (the quantizer loss).

(1) The reconstruction loss is computed using the original audio as the label, and we basically need to apply certain time and frequency transformations to the input/output and compute the L1/L2 distances between them.

(2) The discriminative loss requires a discriminator. As far as I can tell, this hasn't been ported/implemented yet and we'll need to do it if we wanted to compute the loss as stated in the paper (msstftd.py from facebookresearch). We'll need to hook up the discriminator in the training code somewhere (is there any pretrained discriminator here?). Also, it's unclear to me whether we can train the discriminator and the model/generator at the same time (I'm assuming not, and we'll need to train one at a time).

(3) The VQ commitment loss is from the quantizer. It looks like it's summing up the losses across all the residual steps. Are we supposed to train the quantizer at the same time as the encoder/decoders? Or should we train them at different times?

In addition to the general loss function, the paper introduced a balancer (balancer.py) that weighs the reconstruction, discriminative, and commitment losses differently. We would also need to import the balancer code if we want this special balancer.

ArthurZucker commented 1 year ago

Makes sense to me! I think you can focus simply on returning the loss for the modules. The order of training is not that important (when implementing the module wise loss) since you don't need to train (but compare output losses) until you have eveything!

For the discriminator, you can live it in the training file! It should be pretty small and that's usually how we do things 🤗 ! The order of training, on what is frozen when should be in the paper/original codebase, have not looked it up!

hackyon commented 1 year ago

I'll attempt to code up (3) VQ commitment loss first then. I'll reach out if I get stuck or run into any issues. Thanks!

hackyon commented 1 year ago

I added an initial draft here: https://github.com/huggingface/transformers/commit/4f697be0b62c4f3b0401ccbd00d1d46aac81906d

Can you take a look and let me know what you think? Thanks

hackyon commented 1 year ago

FYI I will be traveling in July, so won't be as available that month.

ArthurZucker commented 1 year ago

Sure, would you mind opening a proper PR? Would be easier to test locally and visualize and follow changes!

ZhikangNiu commented 1 year ago

So cool, I reproduced the code and release the code. If you have any question, we can solve together. https://github.com/NoFish-528/encodec-pytorch @hackyon @ArthurZucker In this work, I haven't add balancer, it's difficult for me... Hope you can successful

jonflynng commented 3 months ago

Hey @ArthurZucker I'm familiar with EnCodec so can take a stab at this if it still seems like a good feature to implement.

ArthurZucker commented 2 months ago

Up to you, I think @ylacombe might have some code as they trained the model for parler no? 🤗

ylacombe commented 2 months ago

Hey @jonflynng and @ArthurZucker, I haven't actually trained an Encodec yet. Feel free to take a stab at it @jonflynng !

ylacombe commented 2 months ago

@hackyon add a PR opened for this #24593 but closed it since. Feel free to start from scratch or to take inspiration from it. Either way, it'd be great if you could take a look to it since @sanchit-gandhi made a few relevant observations!

jonflynng commented 2 months ago

@ylacombe yep, I'll start taking a look this week

jonflynng commented 1 month ago

@ArthurZucker @ylacombe

I've got a rough first draft, will take a look more in detail over the next week. I was wondering what the ideal API design is for a model like this with multiple moving parts (and models), as currently I've put a few of the loss calculations within the training loop itself but I imagine that's not what we want? https://github.com/huggingface/transformers/pull/33956/files#diff-5d5859fd55a63733e8f3a00ea4be76b7c770bfa5e254d13408ab1185fc565503R186-R318

ArthurZucker commented 1 month ago

Hey! I think the best API is to compute the loss outside the modeling. The loss is specific to the model, but maybe specific to training. IMO we should have a separate file like a loss_encodec.py in the folder for now, properly separating from the modeling!

jonflynng commented 1 month ago

Ok, have created that file. In terms of testing, are usually a few integration tests enough or are small scale experiments wanted before putting this into production? I've ran a couple fine-tuning runs, but because EnCodec is already well pre-trained I'm not getting any performance increases when fine-tuning on some audio and speech datasets. The generator loss is stagnant throughout.

ArthurZucker commented 1 month ago

We recently added the entire api for this in #34191 ! (individual loss files) Small scale experiments (that take like max 4min to run) are nice once we know for thus that it "works" and trains properly!

jonflynng commented 4 weeks ago

Ah nice, ok. What are the benefits of this design by the way? In terms of EnCodec, I can make the commitment and reconstruction losses which are computed inside the EnCodecModel class use this new design. But the other losses (feature matching, discriminator, adversarial) are computed outside of the model in the training loop so would they stay as they are in that separate file and then just be imported in when needed?

ArthurZucker commented 3 weeks ago

Yep that's how it would best IMO!

jonflynng commented 3 weeks ago

Ok, have done that. I'll have another look at the small-scale experiment. I was doing it in a Colab instance. What exactly would you like to see, just some wandb logs of the experiment and the code maybe?

huggingface / transformers