Open iver56 opened 2 years ago
Isn't this already implemented?
On that note, I would recommend checking out using pyrubberband
instead of librosa's time_stretch
wince the former preserves the transients.
From librosa's documentation of phase_vocoder:
This is a simplified implementation, intended primarily for reference and pedagogical purposes. It makes no attempt to handle transients, and is likely to produce many audible artifacts. For a higher quality implementation, we recommend the RubberBand library [2](https://librosa.org/doc/0.10.0/generated/librosa.phase_vocoder.html#id4) and its Python wrapper pyrubberband.
This ideas is not implemented in audiomentations yet. The idea in this issue is to add "anchors" and allow the output to have the same length as the input. Different parts of the waveform are sped up or down. This idea is inspired by
I'll try to illustrate:
* and another repo where they added anchors not randomly but at the start/end of phones or words. I don't remember the name of the repo right now, but I can look for it
I went looking for the repo, but couldn't find it. But I found this paper that mentions an idea like it: https://ieeexplore.ieee.org/document/9003741
And add a parameter called
leave_length_unchanged
that when set to True makes sure that the output length equals the input length. In that case, some of the audio will have to be sped up and some of it has to be slowed down. Maybe it can have two modes:time_stretch
andspeed
? Or just make two different classes...This is based on the spectrogram time stretching idea in the popular SpecAugment paper, but instead here we apply it directly to the waveform. I also saw a github repo recently where they stretched individual phones or words, and it helped improve their metrics.