teticio / audio-diffusion

Apply diffusion models using the new Hugging Face diffusers package to synthesize music instead of images.
GNU General Public License v3.0
707 stars 69 forks source link

Enables non square mel spectrograms #10

Closed AI-Guru closed 1 year ago

AI-Guru commented 2 years ago

I have implemented the option to have non-square images.

You can do this: --resolution 64,128

teticio commented 1 year ago

Thanks for this!

I have changed a few things:

Note that with arrays, the convention is height, width so I changed sample_size accordingly. You may find that the sample_size stored with your previously trained model has these the wrong way round, but this is easy to change by editing the unet/config.json file. However, for setting the resolution it is still specified as width,height as I think this is more natural. Note that the PIL library frombytes function expects the shape as (width, height), so I had to change this in mel.py (in fact I just used the simpler fromarray function which, for some reason I didn't use here). Very confusing!

Have a look and let me know if this works for you. It would be great if you could also test the changes. Thanks again!

teticio commented 1 year ago

Actually I changed the pipeline to infer the resolution from the mel object - it doesn't make sense to have inconsistent resolutions here. Instead, sample_size is read in the AudioDiffusion convenience wrapper class. Might be nice to store the hop_length, sample_rate and other parameters used in the mel during training here too.