InverseMelScale does not work in inference mode

mthrok commented 3 years ago

InverseMelScale uses SGD inside so it does not work when the global context is no_grad or inference_mode. Or even when requires_grad=False would make it fail. This gives bad UX for inference.

There are couple of possible workarounds

Clone Tensor with requires_grad=True inside of InverseMelScale
Use other method such as SVD. (there was some discussion https://github.com/pytorch/audio/pull/366)

import torch
import torchaudio

trans = torchaudio.transforms.InverseMelScale(n_stft=10)

spec = torch.randn(128, 256, requires_grad=True)
trans(spec)  # pass
with torch.inference_mode():
    trans(spec)  # fail

spec = torch.randn(128, 256, requires_grad=False)
trans(spec)  # fail
with torch.inference_mode():
    trans(spec)  # fail

Traceback (most recent call last):
  File "foo.py", line 10, in <module>
    trans(spec)
  File "/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Development/torchaudio/torchaudio/transforms.py", line 472, in forward
    new_loss.backward()
  File "/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

nateanl commented 3 years ago

According to the discussion, the SVD approach is not stable, so we might choose option 1 as the workaround. I can work on it as it looks straightforward. Do we need to add test for inference mode?

nateanl commented 3 years ago

I found another solution in librosa. in librosa.utils.nnls it uses np.linalg.lstsq which can be replaced by torch.lstsq. So that we don't need to re-implement the SGD manually.

spyroot commented 2 years ago

Hi, @nateanl did you port nnls method?

nateanl commented 2 years ago

Hi @spyroot , not yet. I'll work on it after the 0.12 release. Stay tuned!

spyroot commented 2 years ago

Ratio of relative diff smaller than 1.000000e-01 is 7.614522473886609e-05 Ratio of relative diff smaller than 1.000000e-03 is 0.0 Ratio of relative diff smaller than 1.000000e-05 is 0.0 Ratio of relative diff smaller than 1.000000e-10 is 0.0 Ratio of relative diff padded than 1.000000e-01 is 7.145936251617968e-05 Ratio of relative diff padded than 1.000000e-03 is 0.0 Ratio of relative diff padded than 1.000000e-05 is 0.0 Ratio of relative diff padded than 1.000000e-10 is 0.0

Ratio of relative diff librosa than 1.000000e-01 is 2.3429299744748278e-06 <--- was my target Ratio of relative diff librosa than 1.000000e-03 is 0.0 Ratio of relative diff librosa than 1.000000e-05 is 0.0 Ratio of relative diff librosa than 1.000000e-10 is 0.0

Ratio of relative diff my impl 1.000000e-01 is 0.0 < --- :) solved Ratio of relative diff my impl 1.000000e-03 is 0.0 Ratio of relative diff my impl 1.000000e-05 is 0.0 Ratio of relative diff my impl 1.000000e-10 is 0.0

nateanl commented 2 years ago

Hi @spyroot, do you mean you already implemented the nnls method? Would you like to open a pull request for it? We can help review after the PR is created. Thanks!

spyroot commented 2 years ago

Yes, I used LBFGS and tested on GPU crazy fast and superb abs error with the original source. I'll do a pull request next week. But I had two fixes for two bugs in LBFGS, which I need to commit together.

I have one question about the current implementation. Torch does backward pass in InverseMelScale. So if you do that in the training loop, it is a second backward pass.

So how is it intended to be used if you want to compute inverse in the training loop?

nateanl commented 2 years ago

So if you do that in the training loop, it is a second backward pass.

That's right. The issue is if the module is in inference mode, which means it can't use gradient at all, then the optimization inside will fail. Thus we want to find an alternative solution to make the module work in both training and inference mode.

So how is it intended to be used if you want to compute inverse in the training loop?

It could be. For example, if I have a GAN that predicts the mel-spectrogram and pass it to the InverseMelScale and GriffinLim to get a waveform as the final output. Then we should make sure the gradients go through all modules with no failure. Does that make sense?

spyroot commented 2 years ago

@nateanl thank you very much. The reason I asked about the second case. Imagine we are working in none GAN settings. if you call a backward inside a loss function, that brake thing a bit.

imagine a training pass. (pseudocodish)

   for batch in batches: 
       loss=compute_lose(x) -> if compute_lose does inverse and it has backward()
       self.scaler.scale(loss).backward()
       self.scaler.step(optimizer)
       self.scaler.update()

It is problematic.

Yes, we probably can formulate it as GAN. i.e, you do backward() on the main optimizer than then compute the Inverse.

Thinking a bit deeper. If Inverse computes a solution Ax=b and it is a Solution. And we know that solution exists. ( in this case, by the way, if you find a solution, you don't need to deal with Complex)

Why is it because the solution is float, i.e, you can find a solution for complex, complex -> the output float. Then you can technically compute Inverse inside But, why. Why would you want to do that? We imagine you need that Inverse for loss computation. But if you do have a solution You don't need to minimize anything in optimization formulation because if you inverse as a term in optimization formulation. That term has a solution.

Do you see my point? For example, in my case. I have to do this. in compute_loss(). I am still trying to figure out the most efficient way to do this and avoid backward()

                with torch.no_grad():
                    stfs = self.dts_inverse(mel)

dts_inverse is my implementation.

nateanl commented 2 years ago

@spyroot I see. So in your implementation, dts_inverse doesn't require gradient optimization. This helps solve the issue that InverseMelScale can't work in inference mode.

Regarding the usage in training, although we don't intend to optimize the InverseMelScale, the differentiability of the module is important. Take speech enhancement task as an example, some methods optimize the model based on waveforms instead of spectrograms. They indeed achieve some performance gain by doing that. In that case, we don't want the module is hard-coded as with torch.no_grad() because that breaks the chain of gradients.

stonelazy commented 2 years ago

I found another solution in librosa. in librosa.utils.nnls it uses np.linalg.lstsq which can be replaced by torch.lstsq. So that we don't need to re-implement the SGD manually.

Just wanted to know whether this is implemented by any chance.

747929791 commented 2 years ago

Can this issue be solved by temporarily setting torch.enable_grad at the call site or inside the function?

nateanl commented 2 years ago

Can this issue be solved by temporarily setting torch.enable_grad at the call site or inside the function?

torch.enable_grad works with torch.no_grad, but torch.inference_mode is more strict which doesn't record computation in the backward graph. Therefore the optimization inside InverseMelScale can't be run.

mthrok commented 1 year ago

Addresed via #3280

y10ab1 commented 10 months ago

Hi, I recently encountered the same issue. Has this problem been resolved? I am currently using torchaudio==2.0.2.

pytorch / audio

InverseMelScale does not work in inference mode #1902