Enables non square mel spectrograms

AI-Guru commented 2 years ago

I have implemented the option to have non-square images.

You can do this: --resolution 64,128

teticio commented 1 year ago

Thanks for this!

I have changed a few things:

I think it makes more sense to specify the image size in the audio_to_image.py dataset creation script and to infer this resolution in train_unconditional.py from the dataset. Therefore I have removed the resolution parameter entirely from the downstream training scripts. A future enhancement might be to allow for the UNET to have a different resolution and resize correspondingly, but we should also resize on the way out of the UNET (previously there was a resize on the way in, which I have removed). I am not convinced of the use case though.
I have made the corresponding changes to fix the ImageLogger in the train_vae.py script

I also changed

        pixels_per_second = (mel.get_sample_rate() *
                             mel.y_res / mel.hop_length /
                             mel.x_res)

to

        pixels_per_second = (mel.get_sample_rate() *
                             self.unet.sample_size[1] / mel.hop_length /
                             mel.x_res)

in audiodiffusion.__init__.py

Lastly I added some code in AudioDiffusionPipeline to accept unet.sample_size as an int for backwards compatiiblity with previously trained models.

Note that with arrays, the convention is height, width so I changed sample_size accordingly. You may find that the sample_size stored with your previously trained model has these the wrong way round, but this is easy to change by editing the unet/config.json file. However, for setting the resolution it is still specified as width,height as I think this is more natural. Note that the PIL library frombytes function expects the shape as (width, height), so I had to change this in mel.py (in fact I just used the simpler fromarray function which, for some reason I didn't use here). Very confusing!

Have a look and let me know if this works for you. It would be great if you could also test the changes. Thanks again!

teticio commented 1 year ago

Actually I changed the pipeline to infer the resolution from the mel object - it doesn't make sense to have inconsistent resolutions here. Instead, sample_size is read in the AudioDiffusion convenience wrapper class. Might be nice to store the hop_length, sample_rate and other parameters used in the mel during training here too.

teticio / audio-diffusion

Enables non square mel spectrograms #10