pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch

https://pytorch.org/audio

BSD 2-Clause "Simplified" License

2.5k stars 644 forks source link

Why does `transforms.TimeStretch` return of type `complex64`? #3688

Closed kuraga closed 11 months ago

kuraga commented 11 months ago

🐛 Describe the bug

Good day!

https://pytorch.org/audio/2.1.0/generated/torchaudio.transforms.TimeStretch.html#torchaudio.transforms.TimeStretch.forward:

Stretched spectrogram. The resulting tensor is of the same dtype as the input spectrogram, but the number of frames is changed to ceil(num_frame / rate).

But:

s = torchaudio.transforms.Spectrogram()(x)
s.dtype  # => torch.float32

t = torchaudio.transforms.TimeStretch(fixed_rate=0.9)(s)
t.dtype  # =>  torch.complex64

Should I collect a bug report or don't I understand time stretching?

(previously posted at the forum)

Versions

torchaudio 2.1.1 from Google Colab

mthrok commented 11 months ago

TimeStretch (or underlying phase_vocoder) expects input to be raw spectrogram (the one with power=None) because it manipulates the input signal in complex plane based on the phase information. It alters both phase and magnitude, then returns the complex spectrogram.

https://github.com/pytorch/audio/blob/c5b69336e951beac842aae09bca5cc62c114e018/src/torchaudio/functional/functional.py#L735

torchaudio.transforms.Spectrogram has argument power with default value of 2, which produces real-valued power spectrogram. It discards phase information. In this case, TimeStretch interprets the input signal as having zero phase everywhere.

I feel like it is more user friendly to warn or reject real-valued spectrogram input in TimeStretch.

kuraga commented 11 months ago

@mthrok , thanks!

https://pytorch.org/audio/2.1.0/generated/torchaudio.transforms.TimeStretch.html:

Seems like we need to show the way of getting the picture in the Example.
And fix the statement:

Stretched spectrogram. The resulting tensor is of the same dtype as the input spectrogram, but the number of frames is changed to ceil(num_frame / rate).

Also:

hop_length (int) or None, optional) – Length of hop between STFT windows. (Default: win_length // 2)

But there is no win_length argument.

Your idea about the warning.

mthrok commented 11 months ago

@kuraga

3694 will fix the documentation and #3695 will add the warning if real-valued tensor is passed.

Seems like we need to show the way of getting the picture in the Example.

It is found in Feature Extraction tutorial found in the same documentation, so I will defer to it.

kuraga commented 11 months ago

@mthrok

3694 will fix the documentation and #3695 will add the warning if real-valued tensor is passed.

Wow-wow, thanks!!

Seems like we need to show the way of getting the picture in the Example.

It is found in Feature Extraction tutorial found in the same documentation, so I will defer to it.

I meant librosa.amplitude_to_db call (or .abs().pow(2) etc. call) isn't reflected. But now I see visualisation details are not reflected at methods' documentation.