Closed keunwoochoi closed 5 years ago
For augmentation, could be implemented as transforms to pass and compose to a Dataset?
@ksanjeevan I think we should stick to augmentations that can be done on the gpu. pitch shift and timestretch is possible but pretty difficult on gpu
another idea:
why don't we actually implement STFT and other tf transforms as a dataset transform
operations, similar to torchvision.transform?
STFT in
transform
Is it any better? That said, I also don't know having them as nn.layer
would be better, I only find it make me feel like more natural to use in such a way.
Is it any better?
forget about it. I just learned: the transforms
are running on cpu only for now. we don't want that :-D
:) good to erase out an option with confidence!
Hey so not to harp on the time stretching again but if we did a phase vocoder function and have it be an option of the spectrogram layer it could be applied after the stft run on gpu right? Just trying to think what augmentation we could have that could be useful!
Also @keunwoochoi, when you say "FFT-based fast convolution filtering" do you mean like what you have here?
No I meant overlap-add filtering in pytorch. Tbh I don't think it's gonna be widely used, I'd say we can keep it in mind and see.
@ksanjeevan a phase vocoder would be nice, but I don't think it could be really implemented effieciently on GPUs.
To wrap up a bit..
Would this be our first goal then?
:+1:
Remove CQT, though, it its not that simple and we should make it public very soon (and invite the others)
Ok let me edit a bit, but by CQT there I only meant the CQT filterbanks.
@ksanjeevan a phase vocoder would be nice, but I don't think it could be really implemented effieciently on GPUs.
Hey, so I gave it a shot yesterday anyway and came up with a GPU implementation of the phase vocoder. I've done some time benchmarking using torch.autograd.profiler.profile
and it runs 5x
faster than cpu (and waaay faster than the librosa
implementation).
I've also compared it against torch.stft
to see how much cost it would add and it does go 1.4x
slower, which is not ideal but I didn't know if a dealbreaker so I wanted to ask you guys.
FYI - I've been using something like this to benchmark hope it's the right way:
profile = False if device == 'cpu' else True
spec_stre = SpectrogramStretch(rate, hop_length, fft_size, device)
with torch.autograd.profiler.profile(use_cuda=profile) as prof:
for _ in range(1000):
ret = spec_stre(spectrogram)
ev = prof.total_average()
total_time = ev.cpu_time_total + ev.cuda_time_total
I don't know much about the vocoder but sounds promising. On comparing it against stft, what does it mean? Why a phase vocoder compared with stft?
Just comparing the speed of the two operations (stft vs. phase_vocoder), I wanted to make sure that doing augmentation of the spectrogram didn't cost way more than computing the actual spectrogram.
We could then stretch in time without change in pitch:
Got it. Great!