Closed AI-Guru closed 1 year ago
Thanks for this!
I have changed a few things:
audio_to_image.py
dataset creation script and to infer this resolution in train_unconditional.py
from the dataset. Therefore I have removed the resolution
parameter entirely from the downstream training scripts. A future enhancement might be to allow for the UNET to have a different resolution and resize correspondingly, but we should also resize on the way out of the UNET (previously there was a resize on the way in, which I have removed). I am not convinced of the use case though.train_vae.py
script pixels_per_second = (mel.get_sample_rate() *
mel.y_res / mel.hop_length /
mel.x_res)
to
pixels_per_second = (mel.get_sample_rate() *
self.unet.sample_size[1] / mel.hop_length /
mel.x_res)
in audiodiffusion.__init__.py
AudioDiffusionPipeline
to accept unet.sample_size
as an int
for backwards compatiiblity with previously trained models.Note that with arrays, the convention is height, width so I changed sample_size
accordingly. You may find that the sample_size
stored with your previously trained model has these the wrong way round, but this is easy to change by editing the unet/config.json
file. However, for setting the resolution it is still specified as width,height as I think this is more natural. Note that the PIL
library frombytes
function expects the shape as (width, height)
, so I had to change this in mel.py
(in fact I just used the simpler fromarray
function which, for some reason I didn't use here). Very confusing!
Have a look and let me know if this works for you. It would be great if you could also test the changes. Thanks again!
Actually I changed the pipeline to infer the resolution from the mel object - it doesn't make sense to have inconsistent resolutions here. Instead, sample_size
is read in the AudioDiffusion convenience wrapper class. Might be nice to store the hop_length
, sample_rate
and other parameters used in the mel during training here too.
I have implemented the option to have non-square images.
You can do this: --resolution 64,128