ISTFT custom padding error

PetrochukM commented 4 years ago

🐛 Bug

In a project that I am working on, I need to keep the spectrogram and the signal aligned. I provided a small script below.

The issues that I am running into: 1) At the moment, the length variable in istft has no constraints; therefore, it can extend into the padding. This has the potential for silent errors. 2) The istft function errors with custom padding.

To Reproduce

import torch
import torchaudio
import numpy.testing

n_fft = 16
hop_length = 4
win_length = n_fft
num_frames = 32
signal = torch.randn(hop_length * num_frames)

# Create a spectrogram that aligns with the signal
spectrogram = torch.stft(
    torch.nn.functional.pad(signal, ((n_fft - hop_length) // 2, (n_fft - hop_length) // 2)),
    n_fft=n_fft,
    hop_length=hop_length,
    win_length=n_fft,
    window=torch.hann_window(n_fft),
    center=False)
assert spectrogram.shape[1] == num_frames

# Reconstruct the signal and ensure it matches the original signal
reconstructed_signal = torchaudio.functional.istft(
    spectrogram,
    n_fft=n_fft,
    hop_length=hop_length,
    win_length=n_fft,
    window=torch.hann_window(n_fft),
    center=False)
assert reconstructed_signal.shape[0] == num_frames * hop_length
numpy.testing.assert_almost_equal(reconstructed_signal.numpy(), signal.numpy(), decimal=6)

Traceback (most recent call last):
  File "boo.py", line 25, in <module>
    center=False)
  File "/Users/michaelp/Code/Text-to-Speech/venv/lib/python3.7/site-packages/torchaudio/functional.py", line 189, in istft
    assert window_envelop_lowest > 1e-11, "window overlap add min: %f" % (window_envelop_lowest)
AssertionError: window overlap add min: 0.000000

Potential Solution

I can get this script to work if I change half_n_fft = n_fft // 2 to half_n_fft = (n_fft - hop_length) // 2. This signals to me that it might help for istft to accept a parameter indicating how much padding was applied.

Is that something you'd be interested in?

vincentqb commented 4 years ago

Have you tried specifying the length of the signal? See the documentation

"The n_frame, hop_length, win_length are all the same which prevents the calculation of right padding. These additional values could be zeros or a reflection of the signal so providing length could be useful. If length is None then padding will be aggressively removed (some loss of signal)."

PetrochukM commented 4 years ago

@vincentqb Yes. I have read the documentation and code in detail.

The issue is that the length argument is left-aligned (which is not documented). The padding was initially applied to both sides equally; therefore, I'd need the padding to be stripped from both sides equally.

vincentqb commented 4 years ago

I'm assuming you also experimented with pad_mode and onesided, right? Can you confirm your case is not covered by our unit tests?

PetrochukM commented 4 years ago

Yes.

The pad_mode option is with regard to the type of padding while I am changing how much padding is applied.
The onesided option doesn't have anything to do with padding, from my understanding. It instead controls the number of frequency bins to output.

The unit tests do not cover the above use case. The critical line is torch.nn.functional.pad(signal, ((n_fft - hop_length) // 2, (n_fft - hop_length) // 2)) where I apply my own padding instead of using the center parameter. The padding I apply is slightly different than the center parameter to ensure that the spectrogram and signal align.

vincentqb commented 4 years ago

Do you mean changing the definition of half_n_fft on this line? Doing so breaks compatibility with librosa though. Have you been able to do the same in librosa or scipy?

The assertion that is failing comes from the fact that your modified signal torch.nn.functional.pad(signal, ((n_fft - hop_length) // 2, (n_fft - hop_length) // 2)) fails the NOLA condition for istft. Suppressing the assertion gives a signal of a different shape.

In [2]: reconstructed_signal.shape                                                                                                                                               
Out[2]: torch.Size([134])

In [3]: signal.shape                                                                                                                                                             
Out[3]: torch.Size([128])

PetrochukM commented 4 years ago

The implementation for half_n_fft assumes that there was n_fft // 2 padding applied to both ends. In the example above, I only applied (n_fft - hop_length) // 2 padding to both ends. For that reason, I needed to change half_n_fft to get the code above to run.

The reason I applied the padding in that way is to make assert spectrogram.shape[1] == num_frames work.

With librosa, this is not an issue because it doesn't have the NOLA assertion; therefore, I can remove the padding with slicing after the result is returned. I don't have to rely on the default slicing.

vincentqb commented 4 years ago

Suppressing the NOLA assertion in istft as such

diff --git a/torchaudio/functional.py b/torchaudio/functional.py
index 8073e3a..d453192 100644
--- a/torchaudio/functional.py
+++ b/torchaudio/functional.py
@@ -207,9 +207,9 @@ def istft(

     # check NOLA non-zero overlap condition
     window_envelop_lowest = window_envelop.abs().min()
-    assert window_envelop_lowest > 1e-11, "window overlap add min: %f" % (
-        window_envelop_lowest
-    )
+    # assert window_envelop_lowest > 1e-11, "window overlap add min: %f" % (
+    #     window_envelop_lowest
+    # )

     y = (y / window_envelop).squeeze(1)  # size (channel, expected_signal_len)

yields:

import torch
import torchaudio
import numpy.testing

n_fft = 16
hop_length = 4
win_length = n_fft
num_frames = 32
signal = torch.randn(hop_length * num_frames)

# Create a spectrogram that aligns with the signal
spectrogram = torch.stft(
    torch.nn.functional.pad(signal, ((n_fft - hop_length) // 2, (n_fft - hop_length) // 2)),
    n_fft=n_fft,
    hop_length=hop_length,
    win_length=n_fft,
    window=torch.hann_window(n_fft),
    center=False)
assert spectrogram.shape[1] == num_frames

# Reconstruct the signal and ensure it matches the original signal
reconstructed_signal = torchaudio.functional.istft(
    spectrogram,
    n_fft=n_fft,
    hop_length=hop_length,
    win_length=n_fft,
    window=torch.hann_window(n_fft),
    center=False)

reconstructed_signal.shape                                                                 
# torch.Size([132])

signal.shape                                                                               
# torch.Size([128])

Can you provide the code that works with librosa?

klyukinds commented 4 years ago

Hi, everyone! @vincentqb @PetrochukM I've encountered that kind of problem either and decided to check librosa's output. The code works fine.

n_fft = 16
hop_length = 4
win_length = n_fft
num_frames = 32

signal_np = np.random.randn(hop_length * num_frames)

# Create a spectrogram that aligns with the signal
spectrogram_np = librosa.core.stft(
    np.pad(signal_np, ((n_fft - hop_length) // 2, (n_fft - hop_length) // 2), 'constant', constant_values=(0,0)),
    n_fft=n_fft,
    hop_length=hop_length,
    win_length=n_fft,
    window='hann',
    center=False)
assert spectrogram_np.shape[1] == num_frames, f"{spectrogram_np.shape[1]}"

# Reconstruct the signal and ensure it matches the original signal
reconstructed_signal_np = librosa.core.istft(
    spectrogram_np,
    hop_length=hop_length,
    win_length=n_fft,
    window='hann',
    center=False)

reconstructed_signal_np.shape                                                                
# (140,)

signal_np.shape
# (128,)

np.isclose(reconstructed_signal_np[(n_fft - hop_length) // 2:-(n_fft - hop_length) // 2],signal_np).all()
# True

It seems to me that there is some difference in istft operation between torchaudio and librosa.

LXP-Never commented 7 months ago

# center=False
NFFT = 512
hop_size = 256
S = torch.stft(y, n_fft=NFFT, hop_length=hop_size, win_length=NFFT,
               center=False,
               window=torch.hann_window(window_length=NFFT), return_complex=True)
print("y", y.shape) # [1, 98473]
centerFalse_wav = torch.istft(S, n_fft=NFFT, hop_length=hop_size, win_length=NFFT,
                              center=False,
                              window=torch.hann_window(window_length=NFFT),
                              length=y.shape[-1])

print("y", y.shape)
print("centerFalse_wav", centerFalse_wav.shape)

RuntimeError: istft(CPUComplexFloatType[1, 257, 383], n_fft=512, hop_length=256, win_length=512, window=torch.FloatTensor{[512]}, center=0, normalized=0, onesided=None, length=98473, return_complex=0) window overlap add min: 1 [ CPUBoolType{} ]

pytorch / audio