Spectrogram Augmentation

ksanjeevan commented 5 years ago

I thought we could start a discussion on what/how we'd like to see as far as spectrogram augmentation in the project.

We had already some design discussion about this in #29. Having augmentation done in the network means that depending on the transform, it would also have to modify the labels. But if we are committed to having the Spectrogram as a layer in the model, then augmentations will come after... We can probably have a more thorough design discussion in a separate issue, but any thoughts welcome here too!

As far as implementing, transforms that sound good:

StretchSpecTime: done! although still has the modifying the label issue depending on the task at hand.
FrequencyMasking and TimeMasking: from the SpecAugment article and issue #38. Do we want the masks for every spectrogram in the batch to be independent? If we do, I gave it a try in this gist where masks for a batch look like this. @zcaceres you had implemented the paper right?
GaussianNoise: also in the gist. From this paper by @f0k, we could also have loudness.
PitchShifting: So I've been reading around and seems like a common approach is to resample the audio signal and then time stretch to match the original shape. This presents various problems since it means it wouldn't be a self contained transform. Also other problems with the resampling (#40). Maybe we could do a simpler approach like in the paper before and scale them linearly (in time and pitch)?

keunwoochoi commented 5 years ago

All approaches look interesting for me.

But if we are committed to having the Spectrogram as a layer in the model, then augmentations will come after...

+1

Do we want the masks for every spectrogram in the batch to be independent?

If it doesn't add computation complexity too much (which I doubt), yes, it seems making more sense.

Maybe we could do a simpler approach like in the paper before and scale them linearly (in time and pitch)?

Hmm, can't we do it with the provisional vocoder implementation? And what do you mean by

various problems since it means it wouldn't be a self contained transform.

?

ksanjeevan commented 5 years ago

If it doesn't add computation complexity too much (which I doubt), yes, it seems making more sense.

Ok I'll do some timing to make sure, could also have both versions.

Hmm, can't we do it with the provisional vocoder implementation? And what do you mean by

So since we need to resample the signal, pitch shifting with the phase vocoder would look something like this:

nn.Sequential(
    Resample(), 
    STFT(), 
    StretchSpecTime()
)

So then it's not a module we can apply directly to a spectrogram (i.e. can't just add pitch shifting to our model as nn.Sequential(Spectrogram(), SpecPitchShift()), like we would do for time stretching nn.Sequential(Spectrogram(), StretchSpecTime())). Another problem for example, what if we want both time stretching and pitch shifting? Probably don't want to call the phase vocoder twice with the different speed up rates, and would rather just one call with the final one.

Going with this implementation can be done I just think it breaks away from a clean "each layer adds a transformation". Maybe something like this could be interesting? But yeah pitch shifting will probably need more discussion. Seems the masking + white noise + time stretching could be our initial augmentation.py.

On another note, I'm not super into the "StretchSpecTime" name. We've moved away from calling the complex stft by "spec" so probably should change this.

keunwoochoi commented 5 years ago

So then it's not a module we can apply directly to a spectrogram

what if we want both time stretching and pitch shifting? Probably don't want to call the phase vocoder twice

Good points. If we could put the both-are-changing scenario aside at the moment, wouldn't the name be without Spec anyway? Not because it's complex numbers but because the name is about what it does (time stretch, pitch shift, etc) rather than on which domain it does.

On name, let me open another issue which will be #46

keunwoochoi commented 5 years ago

(continued) So, if we agree with removing Spec from the name, it'd be TimeStretch and PitchShift? Later, doing both could be, um, TimeStretchPitchShift or PitchShiftTimeStretch oh my god.

ksanjeevan commented 5 years ago

wouldn't the name be without Spec anyway? Not because it's complex numbers but because the name is about what it does (time stretch, pitch shift, etc) rather than on which domain it does.

+1, I think initially I put Spec in there since librosa's time_stretch takes in a waveform and ours the stft output. If the argument names are clear I like TimeStretch and PitchShift more though.

Later, doing both could be, um, TimeStretchPitchShift or PitchShiftTimeStretch oh my god.

Oh so you're saying we'd have another layer in case both are used? I just don't quite like the STFT being called as part of the PitchShift...

keunwoochoi commented 5 years ago

we'd have another layer in case both are used

I thought that's what you mean by

'Probably don't want to call the phase vocoder twice with the different speed up rates, and would rather just one call with the final one.'

Wasn't it?

Maybe we also need to decide how to structure the files for all the layers and functionals. I'm also planning to add harmonic-percussive separation (#25) and was thinking to add a new file like beta_hpss.py to make it clear that it's still under dev. On the ultimate structure, #47 coming soon!

EDIT --> #47

ksanjeevan commented 5 years ago

I thought that's what you mean by

Yes. Just to clarify, that's how it would have to be if we used the phase vocoder for pitch shifting. I was just lamenting it makes for an unintuitive flow :).

Maybe we also need to decide how to structure the files for all the layers and functionals.

:+1: :+1:

zcaceres commented 5 years ago

@keunwoochoi yes spec_augment is implemented here: https://github.com/zcaceres/spec_augment

There are also Pytorch implementations of many of the augmentations discussed in this thread here: Basic: https://github.com/zcaceres/fastai-audio/blob/master/DataAugmentation.ipynb GPU: https://github.com/zcaceres/fastai-audio/blob/master/course3B/08zb_DataAugmentation.ipynb Shift Transformation: https://github.com/zcaceres/fastai-audio/blob/master/Shifting.ipynb

ksanjeevan commented 5 years ago

If we do, I gave it a try in this gist where masks for a batch look like this.

So I was thinking a possible interest in having examples in the batch be masked differently would be if working with padded sequences. Then the module could take as input the actual lengths and provide masking boundaries accordingly (i.e. set the max to seq length, not spect num bins)?

I did some timing and clearly spect[:,:,f_0: f_0 + f] = mask_value, spect[:,:,:,t_0: t_0 + t] = mask_value is faster than the per example approach in the gist, so we can discuss if it's necessary.

f0k commented 5 years ago

Another problem for example, what if we want both time stretching and pitch shifting?

What I did in the ISMIR 2015 paper you linked was to apply a single affine transformation to the STFT result that stretches both in time and in frequency: https://github.com/f0k/ismir2015/blob/master/experiments/augment.py#L54-L90 With cuDNN, this could be done using the spatial transformer API.

Later, doing both could be, um, TimeStretchPitchShift or PitchShiftTimeStretch oh my god.

SpectStretchShift? It may be clear from the context what we stretch (the time) and shift (the pitch).

keunwoochoi / torchaudio-contrib

Spectrogram Augmentation #44