keunwoochoi / torchaudio-contrib

A test bed for updates and new features | pytorch/audio
169 stars 22 forks source link

Signal resampling, sinc interpolation #40

Closed ksanjeevan closed 5 years ago

ksanjeevan commented 5 years ago

Using band-limited sinc interpolation for audio signal resampling works well. Would it be worth maybe implementing mode='sinc' for torch.nn.functional.interpolate? Are 'nearest', 'linear' good enough?

keunwoochoi commented 5 years ago

nearest wouldn't result in what we want. linear - ok, but not ideal, sinc would work better. Ugh, I don't recall details but sounds promising/useful.

faroit commented 5 years ago

a sinc kernel is indeed useful and is the basis for a good audio resampler. But then, be aware of that quality audio resamplers are a tough business, and if we do this, we should do it right.

I would personally vote for putting this aside for now so that resampling wouldn't block #29 and #31

keunwoochoi commented 5 years ago

resampling wouldn't block #29 and #31

would it though? i think we can develop it independently and see how we could merge it.

be aware of that quality audio resamplers are a tough business

yeah that's true, but we wouldn't prob target it to be like SOTA quality. I don't really know how to measure resampling quality (SNR?), but there might be some sweet spot where the quality is "good enough* for some training with (much) faster operation.

ATriantafyllopoulos commented 5 years ago

I have to mention that on-the-fly resampling has proven a tough nut for me to crack. I have tried using resampy and scipy for resampling, but both turned out to be huge bottlenecks for training. sox and ffmepg were also tremendously slow.

If anyone is interested in this, I could try to provide a more comprehensive benchmark, but the qualitative overview is the following:

a) Consistent 90%+ GPU utilization doing the resampling offline working with a number of architectures

b) 0-20% GPU utilization with on-the-fly resampling. Training time increased tremendously.

I was lucky enough to work with relatively small data sets that had a uniform sampling rate, so I could afford to do the resampling once offline and storing the temp files, but I think it should be a necessary component of a deep learning audio package.

keunwoochoi commented 5 years ago

Hey, yes, agreed with the motivation.

b) 0-20% gpu

Could you elaborate? Do you mean you’ve tried to resample on GPU?

ATriantafyllopoulos commented 5 years ago

No, the resampling was done on the CPU. This drop in GPU utilization can (probably) be attributed to inefficient data loading.

a) In case of offline resampling, data loading consists of loading a wav file using audiofile and doing some spectrogram transforms. This is done efficiently on the CPU and the GPU never waits for new data to come so its utilization stays on high levels (it is only used for network training).

b) In case of online resampling, there is an extra step of resampling between audio loading and spectrogram computation. Now the GPU has to wait for the CPU to complete the process, and this takes so much time that utilization drops to 0-20% (here it is also only used for network training but it has to wait until the CPU has prepared the data).

f0k commented 5 years ago

In case of online resampling, there is an extra step of resampling between audio loading and spectrogram computation.

Do you use mel spectrograms? Did you, by any chance, try changing the STFT window and hop size and the mel filterbank dynamically depending on the sample rate, instead of explicitly resampling in the time domain?

cpuhrsch commented 5 years ago

cc @jamarshon who is currently working on a resampler based on kaldi.

Jason, I suggest you open an issue and we move this over into pytorch/audio and then add support for this to torchaudio.