keunwoochoi / torchaudio-contrib

A test bed for updates and new features | pytorch/audio
169 stars 22 forks source link

Which layers/operations to implement? #3

Closed keunwoochoi closed 5 years ago

keunwoochoi commented 5 years ago
ksanjeevan commented 5 years ago

For augmentation, could be implemented as transforms to pass and compose to a Dataset?

faroit commented 5 years ago

@ksanjeevan I think we should stick to augmentations that can be done on the gpu. pitch shift and timestretch is possible but pretty difficult on gpu

faroit commented 5 years ago

another idea:

why don't we actually implement STFT and other tf transforms as a dataset transform operations, similar to torchvision.transform?

keunwoochoi commented 5 years ago

STFT in transform

Is it any better? That said, I also don't know having them as nn.layer would be better, I only find it make me feel like more natural to use in such a way.

faroit commented 5 years ago

Is it any better?

forget about it. I just learned: the transforms are running on cpu only for now. we don't want that :-D

keunwoochoi commented 5 years ago

:) good to erase out an option with confidence!

ksanjeevan commented 5 years ago

Hey so not to harp on the time stretching again but if we did a phase vocoder function and have it be an option of the spectrogram layer it could be applied after the stft run on gpu right? Just trying to think what augmentation we could have that could be useful!

Also @keunwoochoi, when you say "FFT-based fast convolution filtering" do you mean like what you have here?

keunwoochoi commented 5 years ago

No I meant overlap-add filtering in pytorch. Tbh I don't think it's gonna be widely used, I'd say we can keep it in mind and see.

faroit commented 5 years ago

@ksanjeevan a phase vocoder would be nice, but I don't think it could be really implemented effieciently on GPUs.

keunwoochoi commented 5 years ago

To wrap up a bit..


Would this be our first goal then?

faroit commented 5 years ago

:+1:

Remove CQT, though, it its not that simple and we should make it public very soon (and invite the others)

keunwoochoi commented 5 years ago

Ok let me edit a bit, but by CQT there I only meant the CQT filterbanks.

ksanjeevan commented 5 years ago

@ksanjeevan a phase vocoder would be nice, but I don't think it could be really implemented effieciently on GPUs.

Hey, so I gave it a shot yesterday anyway and came up with a GPU implementation of the phase vocoder. I've done some time benchmarking using torch.autograd.profiler.profile and it runs 5x faster than cpu (and waaay faster than the librosa implementation).

I've also compared it against torch.stft to see how much cost it would add and it does go 1.4x slower, which is not ideal but I didn't know if a dealbreaker so I wanted to ask you guys.

FYI - I've been using something like this to benchmark hope it's the right way:

profile = False if device == 'cpu' else True
spec_stre = SpectrogramStretch(rate, hop_length, fft_size, device)
with torch.autograd.profiler.profile(use_cuda=profile) as prof:
        for _ in range(1000):
            ret = spec_stre(spectrogram)
ev = prof.total_average()     
total_time = ev.cpu_time_total + ev.cuda_time_total
keunwoochoi commented 5 years ago

I don't know much about the vocoder but sounds promising. On comparing it against stft, what does it mean? Why a phase vocoder compared with stft?

ksanjeevan commented 5 years ago

Just comparing the speed of the two operations (stft vs. phase_vocoder), I wanted to make sure that doing augmentation of the spectrogram didn't cost way more than computing the actual spectrogram.

We could then stretch in time without change in pitch: phase_vocoder

keunwoochoi commented 5 years ago

Got it. Great!