keunwoochoi / torchaudio-contrib

A test bed for updates and new features | pytorch/audio
169 stars 22 forks source link

ISTFT #27

Open keunwoochoi opened 5 years ago

keunwoochoi commented 5 years ago

https://gist.github.com/keunwoochoi/2f349e72cc941f6f10d4adf9b0d3f37e

It works, but I'll probably not gonna make a PR to Pytorch (https://github.com/pytorch/pytorch/issues/3775) because it's written in Python (torch.stft is written in C). Just for a future reference/usage.

faroit commented 5 years ago

Great. Did you did test the reconstruction error?

seungwonpark commented 5 years ago

I think line 33-40 can be replaced with transpose convolution. (I've done that in my private repository, but also didn't make a PR to PyTorch since it's not written in C like torch.stft.) Probably I could open-source my private implementation of inverse-STFT after some code refactoring. That is also based on librosa implementation of istft. Will that be helpful for us?

keunwoochoi commented 5 years ago

reconstruction error

@faroit Could you possibly test by yourself? Because I did, and the error was 0. Not sure if it's for real..

transpose convolution

@seungwonpark Yes. I think there's pros/cons though. torch.rfft would be a complexity of n log n in computation while transpose convolution would be n ** 2. But transpose conv layer might outspeed due to a better parallelization -- although the current approach can be also framed (with reshaping) first to remove the loop, and probably the torch.rfft is only executed for a couple of times.

That said, we can simply time it and compare :) so yes, please share it! Even better if you can measure them.

seungwonpark commented 5 years ago

@keunwoochoi Thanks for pointing out the difference! I'll let you know when I'm ready to share.

seungwonpark commented 5 years ago

Here it is. I've also measured the time performance, and looks like your implementation is much faster! https://github.com/seungwonpark/istft-pytorch

By the way, I can't find out the actual difference between them. The major difference is that mine uses ifft and yours uses irfft. So, I've tried using irfft instead of ifft in my implementation, but no significant difference was found.

faroit commented 5 years ago

@keunwoochoi cool, I tested yours in a small experiment where I run STFT of one library (librosa, scipy, tf, torch) against ISTFT of another and measured the reconstruction RMSE.

ISTFT --> ISTFT RMSE
=================================
test_torch --> test_torch 0.21650636
test_torch --> test_scipy 1.08139886e-07
test_torch --> test_librosa 1.1001805855710549e-07
test_torch --> test_tf 1.9412305e-06
test_scipy --> test_torch 0.21650635
test_scipy --> test_scipy 7.398381e-08
test_scipy --> test_librosa 7.647041059862848e-08
test_scipy --> test_tf 1.9329398e-06
test_librosa --> test_torch 0.21650635
test_librosa --> test_scipy 2.0918270288343244e-16
test_librosa --> test_librosa 2.294981695912998e-16
test_librosa --> test_tf 1.9328936e-06
test_tf --> test_torch 0.21650644
test_tf --> test_scipy 4.216027e-06
test_tf --> test_librosa 4.220019682546531e-06
test_tf --> test_tf 4.694827e-06

i think this is because of the combination of half overlap (you used //4) the hann window and the not existing normalization. Anyway this is easy to fix...

faroit commented 5 years ago

@f0k btw. do you know a good way to write a unit test for STFT/ISTFT errors due to windowing when masking was applied before the ISTFT, e.g. normalizing using the window instead of the squared window....? I was always wondering how to catch these....

keunwoochoi commented 5 years ago

It’s probably due to the boundary. I don’t pre-pad, if you ignore the first and last half-windows in the eval it should be much better for the moment.

nkundiushuti commented 5 years ago

it would be interesting to compare with https://github.com/wslihgt/pyfasst/tree/master/pyfasst/tftransforms I have used as a baseline the stft implementation of JL Durrieu which uses numpy. I think madmom (JKU Linz audio framework) also uses something similar. oh btw you can find some tentative implementation of cqt using nsgt. but it's not reliable. and I don't know how you could port it in pytorch

faroit commented 5 years ago

it would be interesting to compare with https://github.com/wslihgt/pyfasst/tree/master/pyfasst/tftransforms

cool. There also https://github.com/nils-werner/stft which I used for quite a while. I think we should stick to scipy.signal and librosa since they are heavily tested. However, I'm super interested to spot the differences and discuss which one has the best reconstruction performance.. but maybe thats out of scope for this repo ;-)

seungwonpark commented 5 years ago

Hi, recently we've found out that our previous time profiling code in my repo for comparing Choi's implementation(rfft) and mine(deconvolution with parallel ops) was wrong, and fixed it.

And the new conclusion is: deconvolution is much faster! Please check https://github.com/seungwonpark/istft-pytorch.

faroit commented 5 years ago

I think we should close this then and move this initiative over to https://github.com/pytorch/pytorch/issues/3775

@seungwonpark are you willing to help out over there?

keunwoochoi commented 5 years ago

Yeah, we could close it, maybe re-open once we wanna work on a wrapper for it.

keunwoochoi commented 5 years ago

@faroit Hey, so after closing it I think we sort of lost the track. I don't think it's actually necessary to wait for the pytorch istft. There are also some implementation details (e.g., deconv like @seungwonpark's vs rfft as I've done), but maybe we have our own tentative implementation with a less-tentative API? Seems like critical issues were resolved (https://github.com/faroit/stft-istft-experiments/issues/1), and we can follow the API of STFT https://github.com/keunwoochoi/torchaudio-contrib/blob/master/torchaudio_contrib/layers.py#L35.

keunwoochoi commented 5 years ago
faroit commented 5 years ago

Okay, what about we open up an PR with an API proposal to address ISTFT in https://github.com/pytorch/pytorch/issues/3775. Then we could

keunwoochoi commented 5 years ago

opening up a PR

You mean a PR here? (Yes let's do it) Or in pytorch? (Would they care about accepting an API without an implementation? Also, we can easily change it later, at least in this repo)

faroit commented 5 years ago

On pytorch.

Yes we could just add this ISTFT but then it's rather hacky (using conv layers) and would mean it won't be that easy to make it clean and readable. And once we reached that someone would have to start from scratch using libtorch, thus it wouldn't particularly sustainable ;-)

keunwoochoi commented 5 years ago

Ok, I'm not really sure if they'd care an API only. Given the history, they've been adapting what had implemented outside of Pytorch quite actively (in other words, somehow blindly if it seems good), at least for the case of audio processing. So, how about we do it here, maybe

def istft(fft, hop, etc etc, mode='conv'):
    """wrapper for istft"""

def conv_istft(fft, hop, etc etc):
    """one of the options"""

?

It wouldn't block a follow-up dev while we refining it.

keunwoochoi commented 5 years ago

(continued)

..Or, BOTH?! Since we have control and flexibility here -- and the API for ISTFT would be pretty much the same to that of STFT, let me add a tentative one (or persuade @seungwonpark to add one here :) while asking Pytorch. @faroit Could you open a PR for that?

drscotthawley commented 5 years ago

Minor clarification question: If the STFT coefficients are allowed to evolve & train via autograd (I assume they are, yes?), then is the idea of for this ISTFT module to be either

(a) sort of a 'trainable' ISTFT which may not remain a perfect inverse to the trainable STFT? i.e. it is independent from the STFT routine? (@seungwonpark's deconv version looks that way) or (b) a routine that is to be somehow mathematically tied to & the (trainable) STFT and always guaranteed to invert it 'perfectly'?

I'm guessing (a). Yes? Mainly because I have no idea how you'd do (b), and can think of plenty of uses for (a), and proper training could enforce arbitrary levels of inversion-quality if you wanted it.

keunwoochoi commented 5 years ago

I also guess it'd be (a) in the case, i.e., the provided ISTFT is a static, non-trained module.

drscotthawley commented 5 years ago

Ohhh, static & non-trained. Interesting. Thanks for the clarification.

keunwoochoi commented 5 years ago

ISTFT is being developed in the main repo https://github.com/pytorch/audio/pull/130