Open ksanjeevan opened 5 years ago
All approaches look interesting for me.
But if we are committed to having the Spectrogram as a layer in the model, then augmentations will come after...
+1
Do we want the masks for every spectrogram in the batch to be independent?
If it doesn't add computation complexity too much (which I doubt), yes, it seems making more sense.
Maybe we could do a simpler approach like in the paper before and scale them linearly (in time and pitch)?
Hmm, can't we do it with the provisional vocoder implementation? And what do you mean by
various problems since it means it wouldn't be a self contained transform.
?
If it doesn't add computation complexity too much (which I doubt), yes, it seems making more sense.
Ok I'll do some timing to make sure, could also have both versions.
Hmm, can't we do it with the provisional vocoder implementation? And what do you mean by
So since we need to resample the signal, pitch shifting with the phase vocoder would look something like this:
nn.Sequential(
Resample(),
STFT(),
StretchSpecTime()
)
So then it's not a module we can apply directly to a spectrogram (i.e. can't just add pitch shifting to our model as nn.Sequential(Spectrogram(), SpecPitchShift())
, like we would do for time stretching nn.Sequential(Spectrogram(), StretchSpecTime())
). Another problem for example, what if we want both time stretching and pitch shifting? Probably don't want to call the phase vocoder twice with the different speed up rates, and would rather just one call with the final one.
Going with this implementation can be done I just think it breaks away from a clean "each layer adds a transformation". Maybe something like this could be interesting? But yeah pitch shifting will probably need more discussion. Seems the masking + white noise + time stretching could be our initial augmentation.py
.
On another note, I'm not super into the "StretchSpecTime
" name. We've moved away from calling the complex stft by "spec" so probably should change this.
So then it's not a module we can apply directly to a spectrogram
what if we want both time stretching and pitch shifting? Probably don't want to call the phase vocoder twice
Good points. If we could put the both-are-changing scenario aside at the moment, wouldn't the name be without Spec
anyway? Not because it's complex numbers but because the name is about what it does (time stretch, pitch shift, etc) rather than on which domain it does.
On name, let me open another issue which will be #46
(continued)
So, if we agree with removing Spec
from the name, it'd be TimeStretch
and PitchShift
? Later, doing both could be, um, TimeStretchPitchShift
or PitchShiftTimeStretch
oh my god.
wouldn't the name be without Spec anyway? Not because it's complex numbers but because the name is about what it does (time stretch, pitch shift, etc) rather than on which domain it does.
+1, I think initially I put Spec
in there since librosa's time_stretch
takes in a waveform and ours the stft output. If the argument names are clear I like TimeStretch
and PitchShift
more though.
Later, doing both could be, um, TimeStretchPitchShift or PitchShiftTimeStretch oh my god.
Oh so you're saying we'd have another layer in case both are used? I just don't quite like the STFT
being called as part of the PitchShift
...
we'd have another layer in case both are used
I thought that's what you mean by
'Probably don't want to call the phase vocoder twice with the different speed up rates, and would rather just one call with the final one.'
Wasn't it?
Maybe we also need to decide how to structure the files for all the layers and functionals. I'm also planning to add harmonic-percussive separation (#25) and was thinking to add a new file like beta_hpss.py
to make it clear that it's still under dev. On the ultimate structure, #47 coming soon!
EDIT --> #47
I thought that's what you mean by
Yes. Just to clarify, that's how it would have to be if we used the phase vocoder for pitch shifting. I was just lamenting it makes for an unintuitive flow :).
Maybe we also need to decide how to structure the files for all the layers and functionals.
:+1: :+1:
@keunwoochoi yes spec_augment is implemented here: https://github.com/zcaceres/spec_augment
There are also Pytorch implementations of many of the augmentations discussed in this thread here: Basic: https://github.com/zcaceres/fastai-audio/blob/master/DataAugmentation.ipynb GPU: https://github.com/zcaceres/fastai-audio/blob/master/course3B/08zb_DataAugmentation.ipynb Shift Transformation: https://github.com/zcaceres/fastai-audio/blob/master/Shifting.ipynb
If we do, I gave it a try in this gist where masks for a batch look like this.
So I was thinking a possible interest in having examples in the batch be masked differently would be if working with padded sequences. Then the module could take as input the actual lengths and provide masking boundaries accordingly (i.e. set the max to seq length, not spect num bins)?
I did some timing and clearly spect[:,:,f_0: f_0 + f] = mask_value
, spect[:,:,:,t_0: t_0 + t] = mask_value
is faster than the per example approach in the gist, so we can discuss if it's necessary.
Another problem for example, what if we want both time stretching and pitch shifting?
What I did in the ISMIR 2015 paper you linked was to apply a single affine transformation to the STFT result that stretches both in time and in frequency: https://github.com/f0k/ismir2015/blob/master/experiments/augment.py#L54-L90 With cuDNN, this could be done using the spatial transformer API.
Later, doing both could be, um,
TimeStretchPitchShift
orPitchShiftTimeStretch
oh my god.
SpectStretchShift
? It may be clear from the context what we stretch (the time) and shift (the pitch).
I thought we could start a discussion on what/how we'd like to see as far as spectrogram augmentation in the project.
We had already some design discussion about this in #29. Having augmentation done in the network means that depending on the transform, it would also have to modify the labels. But if we are committed to having the Spectrogram as a layer in the model, then augmentations will come after... We can probably have a more thorough design discussion in a separate issue, but any thoughts welcome here too!
As far as implementing, transforms that sound good:
StretchSpecTime
: done! although still has the modifying the label issue depending on the task at hand.FrequencyMasking
andTimeMasking
: from the SpecAugment article and issue #38. Do we want the masks for every spectrogram in the batch to be independent? If we do, I gave it a try in this gist where masks for a batch look like this. @zcaceres you had implemented the paper right?GaussianNoise
: also in the gist. From this paper by @f0k, we could also have loudness.PitchShifting
: So I've been reading around and seems like a common approach is to resample the audio signal and then time stretch to match the original shape. This presents various problems since it means it wouldn't be a self contained transform. Also other problems with the resampling (#40). Maybe we could do a simpler approach like in the paper before and scale them linearly (in time and pitch)?