pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.5k stars 644 forks source link

amplitude normalization in create_fb_matrix #287

Closed jongwook closed 4 years ago

jongwook commented 5 years ago

🐛 Not necessarily a Bug

Not exactly a bug, but I think it makes more sense if the triangular Mel filterbanks are area-normalized, as opposed to the current behavior where they are height-normalized.

To Reproduce / Expected behavior

Steps to reproduce the behavior:

import matplotlib.pyplot as plt
import torchaudio
import librosa

plt.figure(figsize=(16, 8))
plt.subplot(2, 1, 1)
plt.plot(torchaudio.transforms.MelScale(n_stft=1025, sample_rate=16000).fb.detach().cpu().numpy())
plt.title('torchaudio Mel filterbanks')

plt.subplot(2, 1, 2)
plt.plot(librosa.filters.mel(sr=16000, n_fft=2048).transpose())
plt.title('librosa Mel filterbanks')
plt.show()

image

The former has an advantage of preserving the energy similar to the STFT, while the latter keeps the overall spectral envelope unchanged.

If this is an intended behavior, feel free to close this issue. Thanks!

Environment

I did pip install torchaudio -f https://download.pytorch.org/whl/torch_stable.html, following README.md

'0.3.0'

Please copy and paste the output from our environment collection script.

Collecting environment information...
PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.11.1

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

Nvidia driver version: 410.79
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2

Versions of relevant libraries:
[pip3] numpy==1.16.5
[pip3] pytorch-memlab==0.0.3
[pip3] torch==1.2.0
[pip3] torch-bst==0.0.0
[pip3] torch-data==0.0.1
[pip3] torchaudio==0.3.0
[pip3] torchvision==0.4.0
[conda] blas                      1.0                         mkl
[conda] faiss-gpu                 1.5.3            py37h1a5d453_0    pytorch
[conda] mkl                       2019.4                      243
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.14           py37ha843d7b_0
[conda] mkl_random                1.0.2            py37hd81dba3_0
[conda] pytorch-memlab            0.0.3                    pypi_0    pypi
[conda] torch                     1.2.0                    pypi_0    pypi
[conda] torch-bst                 0.0.0                     dev_0    <develop>
[conda] torch-data                0.0.1                     dev_0    <develop>
[conda] torchaudio                0.3.0                    pypi_0    pypi
[conda] torchvision               0.4.0                    pypi_0    pypi
vincentqb commented 4 years ago

Thanks for pointing this out :) Do we have a comparison with kaldi for this graph?

vincentqb commented 4 years ago

I can still reproduce this PR after #292.

vincentqb commented 4 years ago

This would be breaking backward compatibility of course, so we'd need strong demand to change this.

jongwook commented 4 years ago

To answer the first, I was able to convert from one to the other by adding a linear function of Mel frequency, with minimal numerical differences.

I haven't used Kaldi, but I agree that breaking backward compatibility is a bad idea. I think it'd still be helpful to the community if we can make the output exactly match librosa's outputs, as it is very popular for Music processing in Python, and after NVIDIA's Tacotron 2 implementation adopted the same filterbank many TTS systems started to use Librosa's Mel spectrograms as a common mid-level representation for different vocoder implementations, for instance.

I'll be interested in working on a PR for this, in the following ~4 weeks timeline. If you feel that's okay please feel free to assign this to me!

vincentqb commented 4 years ago

To answer the first, I was able to convert from one to the other by adding a linear function of Mel frequency, with minimal numerical differences.

If this is a general transformation that can be used elsewhere, it could be a solution. Otherwise, offering a toggle/switch is more appropriate.

I haven't used Kaldi, but I agree that breaking backward compatibility is a bad idea. I think it'd still be helpful to the community if we can make the output exactly match librosa's outputs, as it is very popular for Music processing in Python, and after NVIDIA's Tacotron 2 implementation adopted the same filterbank many TTS systems started to use Librosa's Mel spectrograms as a common mid-level representation for different vocoder implementations, for instance.

I'll be interested in working on a PR for this, in the following ~4 weeks timeline. If you feel that's okay please feel free to assign this to me!

Thanks! That would be great :)

keunwoochoi commented 4 years ago

+1 for this!

delip commented 4 years ago

To answer the first, I was able to convert from one to the other by adding a linear function of Mel frequency, with minimal numerical differences.

@jongwook do you have a gist for this?

jongwook commented 4 years ago

@delip I did, and unfortunately I no longer have access to it.

I'm feeling bad about not getting my hands on this over the break, but I should be able to soon. I will keep you posted!

jongwook commented 4 years ago

https://gist.github.com/jongwook/d6fdf57a6fd06d56a2bfd6b505e7804c

The notebook above demonstrates that Torchaudio mel spectrogram and librosa mel spectrogram differ by a linear function of Mel frequency, when htk=True is specified in librosa.

I'm thinking of adding a keyword argument normalized=False to MelScale and MelSpectrogram, and use librosa-style filterbank when normalized=True is given.

Regarding htk=True, since the librosa default Slaney filterbank is also quite widely used (ultimately from MATLAB auditory toolbox), I'd be also keen to add another optional kwargs filterbank="htk" or "slaney".

Ultimately, it'd be nice to have module torchaudio.compliance.librosa, but first things first!

ahmed-fau commented 4 years ago

Hi, I am using the MelSpectrogram module of torchaudio v0.4 to define a spectral loss function for training my model. Although it converges properly and the output signal approaches the target one, there are subtle tonal artifacts that cannot be removed with further training optimization. Could this normalization technique be a solution for that? (i.e. to use area normalization as in librosa instead of height normalization).

ahmed-fau commented 4 years ago

https://gist.github.com/jongwook/d6fdf57a6fd06d56a2bfd6b505e7804c

@jongwook Could you please share this notebook again? As the current shared one doesn't work. Or instead provide the linear fucnction to change the normalization style of the torchaudio mel filterbank.

Many thanks in advance

vincentqb commented 4 years ago

This is implemented in #589.