Open xvdp opened 2 years ago
hi @xvdp, thanks for sharing. The improvement looks good to me. regarding the in-place operation, @mthrok do you think grad is a potential issue? I didn't find gradient check unit test for resample
.
@nateanl I did do an PR but didnt add any tests. a. I figured that any existing checks would cover this. b. locally i compared both versions of the resample to ensure that under various dtype, device and from to sampling rates torch.allclose(). Im not sure how to add those tests here. One difference between the sketch code up there is that in the PR i dindt set torch.no_grad(), because that seems like it should be handled equally for all operators.
As I also mention, resampling kernel build can be woefully inefficient if for instance one is resampling to a prime number - for most augmentation cases that is meaningless, so instead in my local projects I use closest fraction to rate change capping max kernel size. Alternatively, for yalls, porting this code to C may be helpful, python doesnt play nice with this kind of operation. But for now, this works.
@xvdp thanks for the suggestions and PR, broadcasting sounds like a good idea! I did try running your code here/on your PR and did not get the same results as the existing implementation though. Seems like CircleCI was broken when you created your PR, so let me leave some comments there as well.
For CUDA device, in prior benchmarking experiments we have seen improvements using CUDA for certain resampling rates, and moving a tensor from CPU to CUDA would take time as well, so we would need to get more comprehensive numbers on this before adding the change.
@nateanl
I didn't find gradient check unit test for resample
gradcheck for resample is in the transforms autograd test (and not functional), and it is passing both when torch.no_grad
is added and not.
cc @adefossez
hi @xvdp, thanks for sharing. The improvement looks good to me. regarding the in-place operation, @mthrok do you think grad is a potential issue? I didn't find gradient check unit test for
resample
.
Autograd should be fine as long as the existing tests pass. I agree with what @xvdp described about kernel being not differentiable.
c. in place operations are faster, if no grad is required, but there is no reason why grad should be applied to these kernels
Thanks for the improvements :) Regarding grad computation I personally sur resampling on neural network outputs and in that case you must backpropagate the gradient to the network. For the issue with slowness when the gcd is large I don't see a fix using this algorithm. The resampy one is much better in that case, but also harder to parallelize.
@adefossez larger kernels are an issue if you need exact resampling, but if you want to randomize it then one doesnt need the exact gdc... , Theres more than one way to compute an approx GCD with kernels < some reasonable max. For example
import math
import torch
def approx_kernel(from_freq, to_freq, max_kernel=320, verbose_error=True):
""" (from_freq, to_freq) -> approx(from_freq, to_freq)
"""
_min, _max = sorted((from_freq, to_freq))
_upres = to_freq > from_freq
ratio = _min/_max
denominator = torch.arange(2, max_kernel + 1)
numerator = denominator * ratio
closest = torch.fmod(numerator, 1)
argmin = (closest- torch.round(closest)).abs().argmin()
_min = torch.round(numerator[argmin]).long().item()
_max = denominator[argmin].long().item()
if verbose_error:
_ratio = _min/_max
_gcd = math.gcd(from_freq, to_freq)
_ex_from = from_freq//_gcd
_ex_to= to_freq//_gcd
_ap_from, _ap_to = [(_max, _min),(_min, _max)][_upres]
print(f"exact ratio {ratio}; exact kernel {_ex_from, _ex_to}")
print(f"approx ratio {_ratio}; approx kernel {_ap_from, _ap_to}")
print(f"error {abs(ratio-_ratio)}")
return[(_max, _min),(_min, _max)][_upres]
k_from, k_to = approx_kernel(96000, 44102)
# exact ratio 0.45939583333333334; exact kernel (48000, 22051)
# approx ratio 0.4594594594594595; approx kernel (37, 17)
# error 6.362612612614837e-05
Transforms like pitch shift - or others - may require resampling, are not transparent to kernel size and do not always require exact shift. If you step thru +6 -6 semitones some kernels are ridiculous, but if you were to initialize them just once, its no problem. So what I did in my augmentation set, every transform that requires resampling can be passed a superclass Resampling that caches all kernels youll probably need for a project. Incidentally I use a more primitive version of the approximate gcd there.
from typing import Optional
import math
import torch
from torch import nn
from torchaudio.transforms import Resample
class Resampling(nn.Module):
""" caches Resample transforms to avoid repeat init
implicitly assigns device on forward
Args
max_kernel int [320] default rationale, 320 := gcd(96000, 44100)
None: exact # can be very costly if gcd == 1
Example caching approximate kernel of max size 30
R = Resampling(max_kernel=30)
x, sr = torchaudio.load(<fname>)
y = R(x, sr, sr//3)
"""
def __init__(self, max_kernel: Optional[int] = 320) -> None:
super().__init__()
self.res = {}
self.max_kernel = max_kernel
def forward(self, x: torch.Tensor, fromsr: int, tosr: int, **kwargs) -> torch.Tensor:
""" returns resampled tensor, caches Resample kernel
Args:
x Tensor (N,C,L) | (C,L) | (L,)
fromsr int source sample rate
tosr int destintion sample rate
kwargs:
max_kernel int, None if assigned temporarily overrides self.max_kernel
kwargs from torchaudio.transforms.Resample
lowpass_filter_width
rolloff
beta
resampling_method
"""
max_kernel = self.max_kernel if 'max_kernel' not in kwargs else kwargs['max_kernel']
fromsr, tosr = self._get_gcd(fromsr, tosr, max_kernel)
kwargs = {key:kwargs[key] for key in ['lowpass_filter_width', 'rolloff', 'beta', 'resampling_method']
if key in kwargs}
if (fromsr, tosr) not in self.res:
self.res[(fromsr, tosr)] = Resample(fromsr, tosr, **kwargs)
return self.res[(fromsr, tosr)].to(device=x.device)(x)
def _get_gcd(self, a: int, b: int, max_kernel: Optional[int] = 320, _step: int = 1) -> tuple:
""" approximate gcd to max size
Args:
a int
b int
max_kernel int | None
"""
_gcd = math.gcd(a, b)
a = int(a//_gcd)
b = int(b//_gcd)
if max_kernel is not None:
while (a or b) > max_kernel:
if b%2:
b = b + 2*(_step%2)-1
elif a%2:
a = a + 2*(_step%2)-1
else:
b = b + 2*(_step%2)-1
return self._get_gcd(a, b, max_kernel=max_kernel, _step=_step+1)
return a, b
Im not sure what you mean about the gradients. Current resample is differentiable only insofar as its forward(), not the kernel creation. I suppose that one could write a project to learn kernels, but that's a a different animal.
Also, sure, i understand that the version using arrays instead of a loop to create the kernel is NOT the solution for creation of lots of huge kernels. But it is 9x faster than whats in torchaudio now. Broadcasting is the bulk of the speed improvement, not the inplace operations.
@adefossez you do have a point, quality does matter, but both audio resampling and shifting pitch degrade in quality quite quickly with range, I will look at float64 but I dont know if it has any effect there, it seemed to me that the result gets quantized anyway. But on a wider context, other than band-limited interpolation, low pass filtering, do they do at the various mastering softwares? e.g. http://src.infinitewave.ca/ It would be cool to have mastering quality transforms
Hey @xvdp sorry for the confusion with the gradient, I hadn't looked at the PR in details, indeed no gradient is needed for the kernel creation.
Regarding the GCD approx, indeed in some cases it could be really nice to do it for the end user, but it is also a bit of a dangerous change and non backward compatible. It would also require a more fine grained analysis of what are acceptable levels of rounding for all possible sample rates.
Another way to potentially improve I had noticed it that comment I had left: https://github.com/pytorch/audio/blob/main/torchaudio/functional/functional.py#L1443 . I think one way to limit this is to use multiple convolutions with different kernel sizes, but then it requires testing to decide at what point splitting the kernels into multiple convolutions become beneficial. Although this would not be as efficient as rounding the sample rates.
🐛 Describe the bug
torchaudio\functional\functionalpy def _get_sinc_resample_kernel()
is slower than it should be because it uses loops instead to taking advantange of torch broadcasing. A simple rewrite will make these 9x faster.There are other issues with resampling kernels that still cause incredible slowness if the gcd of the sample rates does not reduce them. This comment does not address these.
below some sample code of the fix, and a paraphrase of the existing code. Other than the removing loops, a few observations a. using torch.float64 does nothing b. cuda is slower than cpu on both occasions so if a cuda kernel is required it should be cast at the end of kernel building c. in place operations are faster, if no grad is required, but there is no reason why grad should be applied to these kernels
faster