Open PetrochukM opened 4 years ago
Have you tried specifying the length of the signal? See the documentation
"The n_frame, hop_length, win_length are all the same which prevents the calculation of right padding. These additional values could be zeros or a reflection of the signal so providing length could be useful. If length is None then padding will be aggressively removed (some loss of signal)."
@vincentqb Yes. I have read the documentation and code in detail.
The issue is that the length
argument is left-aligned (which is not documented). The padding was initially applied to both sides equally; therefore, I'd need the padding to be stripped from both sides equally.
I'm assuming you also experimented with pad_mode
and onesided
, right? Can you confirm your case is not covered by our unit tests?
Yes.
pad_mode
option is with regard to the type of padding while I am changing how much padding is applied. onesided
option doesn't have anything to do with padding, from my understanding. It instead controls the number of frequency bins to output. The unit tests do not cover the above use case. The critical line is torch.nn.functional.pad(signal, ((n_fft - hop_length) // 2, (n_fft - hop_length) // 2))
where I apply my own padding instead of using the center
parameter. The padding I apply is slightly different than the center
parameter to ensure that the spectrogram and signal align.
Do you mean changing the definition of half_n_fft
on this line? Doing so breaks compatibility with librosa though. Have you been able to do the same in librosa or scipy?
The assertion that is failing comes from the fact that your modified signal torch.nn.functional.pad(signal, ((n_fft - hop_length) // 2, (n_fft - hop_length) // 2))
fails the NOLA condition for istft. Suppressing the assertion gives a signal of a different shape.
In [2]: reconstructed_signal.shape
Out[2]: torch.Size([134])
In [3]: signal.shape
Out[3]: torch.Size([128])
The implementation for half_n_fft
assumes that there was n_fft // 2
padding applied to both ends. In the example above, I only applied (n_fft - hop_length) // 2
padding to both ends. For that reason, I needed to change half_n_fft
to get the code above to run.
The reason I applied the padding in that way is to make assert spectrogram.shape[1] == num_frames
work.
With librosa
, this is not an issue because it doesn't have the NOLA assertion; therefore, I can remove the padding with slicing after the result is returned. I don't have to rely on the default slicing.
Suppressing the NOLA assertion in istft
as such
diff --git a/torchaudio/functional.py b/torchaudio/functional.py
index 8073e3a..d453192 100644
--- a/torchaudio/functional.py
+++ b/torchaudio/functional.py
@@ -207,9 +207,9 @@ def istft(
# check NOLA non-zero overlap condition
window_envelop_lowest = window_envelop.abs().min()
- assert window_envelop_lowest > 1e-11, "window overlap add min: %f" % (
- window_envelop_lowest
- )
+ # assert window_envelop_lowest > 1e-11, "window overlap add min: %f" % (
+ # window_envelop_lowest
+ # )
y = (y / window_envelop).squeeze(1) # size (channel, expected_signal_len)
yields:
import torch
import torchaudio
import numpy.testing
n_fft = 16
hop_length = 4
win_length = n_fft
num_frames = 32
signal = torch.randn(hop_length * num_frames)
# Create a spectrogram that aligns with the signal
spectrogram = torch.stft(
torch.nn.functional.pad(signal, ((n_fft - hop_length) // 2, (n_fft - hop_length) // 2)),
n_fft=n_fft,
hop_length=hop_length,
win_length=n_fft,
window=torch.hann_window(n_fft),
center=False)
assert spectrogram.shape[1] == num_frames
# Reconstruct the signal and ensure it matches the original signal
reconstructed_signal = torchaudio.functional.istft(
spectrogram,
n_fft=n_fft,
hop_length=hop_length,
win_length=n_fft,
window=torch.hann_window(n_fft),
center=False)
reconstructed_signal.shape
# torch.Size([132])
signal.shape
# torch.Size([128])
Can you provide the code that works with librosa?
Hi, everyone! @vincentqb @PetrochukM I've encountered that kind of problem either and decided to check librosa's output. The code works fine.
n_fft = 16
hop_length = 4
win_length = n_fft
num_frames = 32
signal_np = np.random.randn(hop_length * num_frames)
# Create a spectrogram that aligns with the signal
spectrogram_np = librosa.core.stft(
np.pad(signal_np, ((n_fft - hop_length) // 2, (n_fft - hop_length) // 2), 'constant', constant_values=(0,0)),
n_fft=n_fft,
hop_length=hop_length,
win_length=n_fft,
window='hann',
center=False)
assert spectrogram_np.shape[1] == num_frames, f"{spectrogram_np.shape[1]}"
# Reconstruct the signal and ensure it matches the original signal
reconstructed_signal_np = librosa.core.istft(
spectrogram_np,
hop_length=hop_length,
win_length=n_fft,
window='hann',
center=False)
reconstructed_signal_np.shape
# (140,)
signal_np.shape
# (128,)
np.isclose(reconstructed_signal_np[(n_fft - hop_length) // 2:-(n_fft - hop_length) // 2],signal_np).all()
# True
It seems to me that there is some difference in istft operation between torchaudio and librosa.
# center=False
NFFT = 512
hop_size = 256
S = torch.stft(y, n_fft=NFFT, hop_length=hop_size, win_length=NFFT,
center=False,
window=torch.hann_window(window_length=NFFT), return_complex=True)
print("y", y.shape) # [1, 98473]
centerFalse_wav = torch.istft(S, n_fft=NFFT, hop_length=hop_size, win_length=NFFT,
center=False,
window=torch.hann_window(window_length=NFFT),
length=y.shape[-1])
print("y", y.shape)
print("centerFalse_wav", centerFalse_wav.shape)
RuntimeError: istft(CPUComplexFloatType[1, 257, 383], n_fft=512, hop_length=256, win_length=512, window=torch.FloatTensor{[512]}, center=0, normalized=0, onesided=None, length=98473, return_complex=0) window overlap add min: 1 [ CPUBoolType{} ]
🐛 Bug
In a project that I am working on, I need to keep the spectrogram and the signal aligned. I provided a small script below.
The issues that I am running into: 1) At the moment, the
length
variable inistft
has no constraints; therefore, it can extend into the padding. This has the potential for silent errors. 2) Theistft
function errors with custom padding.To Reproduce
Potential Solution
I can get this script to work if I change
half_n_fft = n_fft // 2
tohalf_n_fft = (n_fft - hop_length) // 2
. This signals to me that it might help foristft
to accept a parameter indicating how much padding was applied.Is that something you'd be interested in?