Closed jongwook closed 4 years ago
Thanks for pointing this out :) Do we have a comparison with kaldi for this graph?
I can still reproduce this PR after #292.
This would be breaking backward compatibility of course, so we'd need strong demand to change this.
To answer the first, I was able to convert from one to the other by adding a linear function of Mel frequency, with minimal numerical differences.
I haven't used Kaldi, but I agree that breaking backward compatibility is a bad idea. I think it'd still be helpful to the community if we can make the output exactly match librosa's outputs, as it is very popular for Music processing in Python, and after NVIDIA's Tacotron 2 implementation adopted the same filterbank many TTS systems started to use Librosa's Mel spectrograms as a common mid-level representation for different vocoder implementations, for instance.
I'll be interested in working on a PR for this, in the following ~4 weeks timeline. If you feel that's okay please feel free to assign this to me!
To answer the first, I was able to convert from one to the other by adding a linear function of Mel frequency, with minimal numerical differences.
If this is a general transformation that can be used elsewhere, it could be a solution. Otherwise, offering a toggle/switch is more appropriate.
I haven't used Kaldi, but I agree that breaking backward compatibility is a bad idea. I think it'd still be helpful to the community if we can make the output exactly match librosa's outputs, as it is very popular for Music processing in Python, and after NVIDIA's Tacotron 2 implementation adopted the same filterbank many TTS systems started to use Librosa's Mel spectrograms as a common mid-level representation for different vocoder implementations, for instance.
I'll be interested in working on a PR for this, in the following ~4 weeks timeline. If you feel that's okay please feel free to assign this to me!
Thanks! That would be great :)
+1 for this!
To answer the first, I was able to convert from one to the other by adding a linear function of Mel frequency, with minimal numerical differences.
@jongwook do you have a gist for this?
@delip I did, and unfortunately I no longer have access to it.
I'm feeling bad about not getting my hands on this over the break, but I should be able to soon. I will keep you posted!
https://gist.github.com/jongwook/d6fdf57a6fd06d56a2bfd6b505e7804c
The notebook above demonstrates that Torchaudio mel spectrogram and librosa mel spectrogram differ by a linear function of Mel frequency, when htk=True
is specified in librosa.
I'm thinking of adding a keyword argument normalized=False
to MelScale
and MelSpectrogram
, and use librosa-style filterbank when normalized=True
is given.
Regarding htk=True
, since the librosa default Slaney filterbank is also quite widely used (ultimately from MATLAB auditory toolbox), I'd be also keen to add another optional kwargs filterbank="htk" or "slaney"
.
Ultimately, it'd be nice to have module torchaudio.compliance.librosa
, but first things first!
Hi, I am using the MelSpectrogram module of torchaudio v0.4 to define a spectral loss function for training my model. Although it converges properly and the output signal approaches the target one, there are subtle tonal artifacts that cannot be removed with further training optimization. Could this normalization technique be a solution for that? (i.e. to use area normalization as in librosa instead of height normalization).
https://gist.github.com/jongwook/d6fdf57a6fd06d56a2bfd6b505e7804c
@jongwook Could you please share this notebook again? As the current shared one doesn't work. Or instead provide the linear fucnction to change the normalization style of the torchaudio mel filterbank.
Many thanks in advance
This is implemented in #589.
🐛 Not necessarily a Bug
Not exactly a bug, but I think it makes more sense if the triangular Mel filterbanks are area-normalized, as opposed to the current behavior where they are height-normalized.
To Reproduce / Expected behavior
Steps to reproduce the behavior:
The former has an advantage of preserving the energy similar to the STFT, while the latter keeps the overall spectral envelope unchanged.
If this is an intended behavior, feel free to close this issue. Thanks!
Environment
I did
pip install torchaudio -f https://download.pytorch.org/whl/torch_stable.html
, followingREADME.md
torchaudio.__version__
print? (If applicable)'0.3.0'
Please copy and paste the output from our environment collection script.