teticio / audio-diffusion

Apply diffusion models using the new Hugging Face diffusers package to synthesize music instead of images.
GNU General Public License v3.0
707 stars 69 forks source link

Dataset constriants #30

Closed deepak-newzera closed 1 year ago

deepak-newzera commented 1 year ago

@teticio Thanks for making available this nice music generation model. It helped me a lot in my project. I played around with the pre-trained models and the results are very sensible. I would like to train my own model using my library of music recordings. I have a few doubts regarding that. Please help me clarify these.

How many audio files are used for training teticio/audio-diffusion-256 model? What is the time span of each audio file? Can the music recordings be in mp3 or wav format?

teticio commented 1 year ago

Glad you liked it. Around 400 files were used. (You can load the dataset into a pandas dataframe and do a "unique" on the filename). If you count the number of rows (I think there were around 20,000), this tells you the total length = 5s * 20,000 = 27h or about 4 minutes per track on average.

deepak-newzera commented 1 year ago

I did the following way, please check it and comment on it:

I have some mp3 music recordings. I made a total of around 5000 clips out of those recordings, by splitting each of the recordings to make them be of 5 seconds each. Then I used the command python scripts/audio_to_images.py --resolution 256,256 --hop_length 1024 --input_dir Splitted_mp3s --output_dir spectrogram_data-splitted-mp3-256 to get the spectogram data.

Then I executed the command accelerate launch scripts/train_unet.py --dataset_name spectrogram_data-splitted-mp3-256 \ --hop_length 1024 --output_dir models/audio-diffusion-splitted-mp3-256/ --train_batch_size 2 --num_epochs 100 --gradient_accumulation_steps 8 --save_images_epochs 100 --save_model_epochs 1 --scheduler ddim --learning_rate 1e-4 --lr_warmup_steps 500 --mixed_precision no to train the model with my dataset. The training is in progress.

Is this the correct way to train the model? Please let me know

teticio commented 1 year ago

Best not to split the mp3s yourself, as the split is not exactly 5 seconds. The audio_to_images script will do this for you - just provide a folder of regular mp3s. It should still work OK. What you have done looks correct otherwise.

deepak-newzera commented 1 year ago

I initially did the training without splitting only. But it gave clumsy and noisy outputs. Now I completed training with splitting as well. Yet the outputs are bad! I am doing the following to test the trained model: audio_diffusion = AudioDiffusion('/home/deepak/mansion/AD/audio-diffusion/models/audio-diffusion-splitted-mp3-256') image, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio() display(image) display(Audio(audio, rate=sample_rate))

Please give me some suggestions for getting clean outputs

deepak-newzera commented 1 year ago

Also, is there a way to evaluate this model through some metrics? (Like checking how close the generated music is close to the training data)

teticio commented 1 year ago

It's a bit hard to say without being able to see you model. You could consider pushing it (with the tensorboad logs which should be included by default) to Hugging Face hub. Then I could look at it. One thing you can do is use the test_mel.ipynb notebook to load an example from your test dataset (make sure you set the parameters to mel to match those in the generation - i.e., hop_length 1024) and see how the recreated mp3 sounds. It is also possible that you don't have enough data, but I can't say as I didn't try with < 20,000 samples.

Regarding your second question about metrics, you can run tensorboard --logdir=. and see the loss curves and generated samples per epoch as training progresses. The losses measure how well the model is able to reconstruct an audio after noising and denoising. It doesn't measure the quality of samples generated from denoising pure noise (which is the generative process).

deepak-newzera commented 1 year ago

Yeah, the test_mel.ipynb is also not recreating the mp3s accurately. What might be the problem? Also, for your dataset for epoch, the iterations are like 20000/20000 right?

teticio commented 1 year ago

So I would not recommend hop_length=1024: use the default (leave it blank or put 512). The higher hop_length was for low resolution cases. Can't remember the details but you can see my tensorboard here https://huggingface.co/teticio/audio-diffusion-256/tensorboard. I did 100 epochs. Before you do any training, make sure you can get a decent quality reconstruction of an audio sample from a mel image. Again, if you push your dataset to HF, I can download it and try it out, but try to solve it yourself first. Goodl luck and let me know how you get on.

teticio commented 1 year ago

PS: note that the first epochs have very quiet audio samples in the tensorboard because I was not normalizing them at first

deepak-newzera commented 1 year ago

That's a really supporting reply. I will keep trying. If possible, you also please try it out. This is the link to my data directory containing the mp3 files. https://drive.google.com/file/d/1lRYkvEzfpsiCc5byTBBl9nFbmeNnAnJg/view?usp=share_link

deepak-newzera commented 1 year ago

@teticio Also, please let me know how to generate longer samples from the pre-trained model.

deepak-newzera commented 1 year ago

@teticio I would like to simulate your model by training with your dataset. Could you please provide your dataset? I could see it at https://huggingface.co/datasets/teticio/audio-diffusion-256/tree/main/data. But it is in parquet format. How can I get mp3 files from it?

deepak-newzera commented 1 year ago

@teticio I pushed my dataset to the HF and it can be found at https://huggingface.co/datasets/deepak-newzera/spectrogram_data_max_music_dataset-1

teticio commented 1 year ago

The dataset looks good and I checked how it sounds in the test_mel.ipynb and it sounds OK to me. If you want to use my dataset, you can load it with ds = load_dataset('teticio/audio-diffusion-256') or your one with ds = load_dataset('deepak-newzera/spectrogram_data_max_music_dataset-1') But you won't be able to access the original mp3s. Also, training using my dataset is just a question of setting --dataset_name teticio/audio-diffusion-256

On Sat, Mar 4, 2023 at 5:47 PM deepak-newzera @.***> wrote:

@teticio https://github.com/teticio I pushed my dataset to the HF and it can be found at https://huggingface.co/datasets/deepak-newzera/spectrogram_data_max_music_dataset-1

— Reply to this email directly, view it on GitHub https://github.com/teticio/audio-diffusion/issues/30#issuecomment-1454819187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRPDB7WWM3W4F2KBZVYCKTW2N52RANCNFSM6AAAAAAVKK5MEM . You are receiving this because you were mentioned.Message ID: @.***>

deepak-newzera commented 1 year ago

The dataset at deepak-newzera/spectrogram_data_max_music_dataset-1 is a newly created dataset. I have 180 music recordings. From each recording, I took 8-second clips as 0s to 8s, 1s to 9s and so on. This way I expanded the dataset and trained the model with this dataset which now contains around 15000 8-second clips. Now I could hear some better music outputs with this trained model.

But I have a doubt. While producing output music (running model inference), the progress bar iterates from 0 to 1000 if your model is used. But in the case of my model, it is iterating from 0 to 50 only. What does this signify? Does it affect the quality of the output?

teticio commented 1 year ago

This will be because you trained a DDIM model (with the --dim flag). My experience has been that the results are not so good with DDIM. But not to worry, I think it is pretty much equivalent to take a model trained with DDIM and change the scheduler to be DDPM. Or, pass to the inference a eta of 1 and and num_inference_steps ot 1000.

On Mon, Mar 6, 2023 at 6:00 PM deepak-newzera @.***> wrote:

The dataset at deepak-newzera/spectrogram_data_max_music_dataset-1 is a newly created dataset. I have 180 music recordings. From each recording, I took 8-second clips as 0s to 8s, 1s to 9s and so on. This way I expanded the dataset and trained the model with this dataset which now contains around 15000 8-second clips. Now I could hear some better music outputs with this trained model.

But I have a doubt. While producing output music (running model inference), the progress bar iterates from 0 to 1000 if your model is used. But in the case of my model, it is iterating from 0 to 50 only. What does this signify? Does it affect the quality of the output?

— Reply to this email directly, view it on GitHub https://github.com/teticio/audio-diffusion/issues/30#issuecomment-1456659000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRPDBZWUPHD62SPJFEHW2LW2YQ5RANCNFSM6AAAAAAVKK5MEM . You are receiving this because you were mentioned.Message ID: @.***>