slp-rl / aero

This repo contains the official PyTorch implementation of "Audio Super Resolution in the Spectral Domain" (ICASSP 2023)
MIT License
190 stars 24 forks source link

Training metrics #7

Closed fmac2000 closed 11 months ago

fmac2000 commented 1 year ago

Hi Authors,

I can't find any papers that beat your implementation - what a milestone! I am considering training the model for 16khz - 48khz and 22khz - 48khz.

Could I ask what was the hardware you used and the steps of the provided checkpoints were? Any additional information regarding the parameters for training would be extremely helpful, I can provide the checkpoints to this repo for open-source use once they're trained.

Again, fantastic stuff here! Thank you 👍

m-mandel commented 1 year ago

Hi there,

Thank you for the kind words!

For the less consuming configurations (e.g. 4-8 kHz), we ran the model on a single A5000 (24 GB ram of memory) with a batch size of around 16. For more demanding configurations (e.g. 12-48 kHz), we ran the model on 2 A5000 in parallel.

For all configurations that used the VCTK dataset, we ran for 125 epochs. You can download an example for the training log from here, where we train for 125 epochs with each epoch having 4012 steps. All together around 500,000 steps.

The exact parameters for the training for the 4-16 configuration are detailed in the YAML files in the conf folder.

It would be great if you can share your findings, thank you!

fmac2000 commented 1 year ago

Hi Mandel,

As promised here is the 16-48khz checkpoint. NFFT 512 & HS 64 I'm curious, could this model be used to upsample vocals/music?

Thanks, I've included the training log if your curious. 2xV100s for 500k steps, around a week of training. Worth the wait! We removed p315 and s5 from the VCTK dataset before preprocessing, apart from that, everything went smoothly!

m-mandel commented 1 year ago

Thank you! The model can be used to upsample vocals/music. As I demonstrate in the paper, I use the model to upsample the MUSDB dataset. Typically, the speakers removed from the VCTK dataset are p315 and p280. Any reason you chose differently?

According your config file, you trained on a batch-size of 1. Why is this?

fmac2000 commented 1 year ago

Hi Mandel, dang it - I uploaded the version where the batch-size was incorrect, we used a batch-size of 4 for training**

We often choose p315 just because it has the smallest file-size among the dataset. S5 due to its lack of labeling in speaker_info.txt and that it's name did not follow the format of the dataset. - some dubious logic I know 😝

Thanks for the information regarding only removing p315 & p280, I did not know this was common practice!

That's fantastic that this can also apply to MUSDB, I may someday train a universal model using your work - please do let me know if you have anything in the pipeline so I may train on that in-place of aero.

Thanks again Mandel - you are more than welcome to upload this checkpoint for the public

m-mandel commented 11 months ago

Thank you fmac2000! I uploaded the checkpoint to the Google Drive.

Good luck!