pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.43k stars 635 forks source link

CQT, iCQT, and VQT implementations and testing #3804

Open d-dawg78 opened 1 week ago

d-dawg78 commented 1 week ago

Hey everyone,

I am happy to propose the addition of the CQT, iCQT, and VQT. The first two have been requested by issue 588. Since the CQT is a VQT with parameter gamma=0, I figured the VQT should be added to the package too. It also figures quite prominently in the research community, even as a time-frequency representation for neural networks. Here are a few important details.

General

The proposed transforms follow and test against the librosa implementations. Note that, since the algorithms are based on recursive sub-sampling, the results between the proposed transforms and librosa gradually diverge as the number of resampling iterations increases; the resampling algorithms differ. The librosa comparison test thresholds are adapted as such. The implementation being matched is the following:

librosa_vqt = vqt(
    y=<Y>,
    sr=<SAMPLE_RATE>,
    hop_length=<HOP_LENGTH>,
    fmin=<F_MIN>,
    n_bins=<N_BINS>,
    intervals="equal",
    gamma=<GAMMA>,
    bins_per_octave=<BINS_PER_OCTAVE>,
    tuning=0.,
    filter_scale=1,
    norm=1,
    sparsity=0.,
    window=<WINDOW>,
    scale=False,
    pad_mode="constant",
    res_type=<RES_TYPE>,
    dtype=<DTYPE>,
)

The <ARGUMENTS> (similar throughout all three transforms) are the controllable ones in the proposed code . The others are "hard-coded". In my opinion, they should stay that way to avoid unnecessary complexity. Future iterations of the transform could incorporate some of these arguments however, if requested by the community!

Tests

I was unable to make the transforms torch-scriptable. Maybe this should be the focus of a future PR. For the rest, I was able to test on CPU but not GPU for installation reasons. Feel free to let me know if any are lacking.

Speed

On the audio snippet from here, over 100 iterations, with dtype=torch.float64:

VQT - torchaudio: 15.208; librosa 50.121 (seconds)
CQT - torchaudio: 15.188; librosa 47.686 (seconds)
iCQT - torchaudio: 7.029; librosa 200.069 (seconds)

Sanity Check

Here's an image of the CQT-gram generated using the following parameters:

SAMPLE_RATE = 44100
HOP_LENGTH = 512
F_MIN = 32.703
N_BINS = 108
BINS_PER_OCTAVE = 12

cqts

The results are pretty much identical! Feel free to request changes or ask me any questions on this PR. I'll be happy to answer, and am excited to get these transforms to the package 🫡

pytorch-bot[bot] commented 1 week ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/audio/3804

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zaptrem commented 3 days ago

Awesome contribution! A bunch of torch.ones tensors are initiated on CPU regardless of the input tensor's device. Also, it would be nice if there was an inverse VQT function as well. Also also, do you know a set of parameters that would result in a perfect or nearly perfect reconstruction? I had to fiddle with the filter lengths code to get something that was even close, but there's still an upper frequency buzzing sound and increased loudness at the start/end. I also noticed my 262,144 input to a CQT with hop_size 512 had an output size of 513 instead of 512 unless I set the hop_size to 513, but that may be a result of the aforementioned fiddling.

d-dawg78 commented 3 days ago

Hey, here's to addressing the feedback ☝️

  1. Good catch on the torch.ones front - the most recent commit should address this issue.

  2. We are following the librosa VQT, CQT, and iCQT algorithms, and they opted not to implement the inverse VQT for good reason. I think we should do the same, at least for now.

  3. Here are parameters that led to decent waveform reconstruction on my end:

    SAMPLE_RATE = 16000
    HOP_LENGTH = 256
    F_MIN = 32.703
    N_BINS = 672
    BINS_PER_OCTAVE = 96

    Increasing the N_BINS and BINS_PER_OCTAVE accordingly increases CQT resolution, and by extension the reconstruction is much better 🙂

  4. I don't really have a good answer for this. Probably the result of the set of parameters you're using..?