Thoughts and suggestions about improving precision and smoothening the results

Here are three topics related to the postprocessing methods.

1. Why use sigmoid here?

https://github.com/maxrmorrison/torchcrepe/blob/9aecc86f5f3ef908bd75656368639686480800e0/torchcrepe/decode.py#L44-L49 As mentioned in the original CREPE paper, the frequency bins are "Gaussian-blurred" by the ground truth f0 in the training label. As the unbiased estimate of the expectation of the normal distribution is the sample average, the converting method should be relu instead of sigmoid, i.e., computing the direct average of local bins with positive value. Also, the original TensorFlow repository uses local average instead of sigmoid. I did my own experiments and proved that results produced by direct average of logits is much smoother than the current version, even without dithering and filtering.

2. Combine viterbi and weighted argmax.

The current viterbi decoder searches frequency bins along the best path, but in a precision of 20 cents, since viterbi algorithm is for discrete states: https://github.com/maxrmorrison/torchcrepe/blob/9aecc86f5f3ef908bd75656368639686480800e0/torchcrepe/decode.py#L76-L80 However, we can then apply what the weighted argmax decoder does, as something called weighted_viterbi for example. For short, this means replacing the first line of argmax operation in the weighted argmax decoder with viterbi. In this way we got a smoother result without quantization errors, while not depending on dithering. The original TensorFlow repository also implemented this as the default behavior of the viterbi decoder.

3. Consider disabling dithering or making it optional.

As discussed above, dithering seems to do more harm than good, especially to weighted decoders. In my own experiments, the weighted_viterbi decoder produce quite smooth results without dithering and filtering, which are also more accurate without random errors broughted by dithering.

There are two ways to solve this problem in my opinion:

Disable dithering to weighted decoder and enable it for others.
Add an option to let the user choose whether to apply dithering.

What are your thoughts about these topics? I'm submitting this issue because I think it would be better to have some discussions before I could make a pull request on it.

There's some good thoughts here. I've visited these ideas before, so here's what I've found:

While the authors of CREPE apply a Gaussian blur, that doesn't mean that the distribution predicted by the network is normal. In practice, it's not, and has density in the harmonics. So I am not convinced by the argument in favor of ReLU.
Using viterbi makes assumptions about the maximum pitch velocity of the data distribution and incurs a O(nlogn) cost. You can combine them, but it will be slower and not work for some data.
I'm not currently in favor of removing the dithering. I think it's an important safeguard against people misusing the tool and getting poor results. Specifically, downstream tasks that perform a new quantization at a different width can cause banding, where the effective resolution of the pitch estimator can reduce significantly. What experiments are you performing where dithering is having a significant effect? It should be an optimal level of dithering: no more likely to increase than decrease error relative to the true, continuous value. If you can show me a situation where the dithering is harmful, then I would accept a pull request to make it optional--but on by default.

One thing you could consider doing is replacing the dithering with localized linear smoothing as done in weighted_argmax. That could smooth the quantization error without randomness. So that would mean no O(nlogn) time cost, no dithering, no quantization error, and no domain-specific assumptions on pitch velocity. That's probably our best bet.

I think there is something that I need to explain more clearly.

The original implementation does not use sigmoid to calculate weights (see here). The reason why I say ReLU is that some bins are masked as -inf (for example, bins below fmin and above fmax), and weights on these bins should not affect the result. What I really mean is that it will be better to directly compute local average around a specific bin. There is density in the harmonics, but within a small range of 9 bins, the distribution should be regarded as normal.
Viterbi decoder in the original implementation calls to_local_average_cents (see here), which means the two practices are combined by default. Performing a weighted average computation does not cost lots of time, since: 1. it is done in O(n) time; 2. it is done by batch computing in NumPy, while viterbi algorithm of librosa uses loops in Python and performs many indexing read/write of ndarrays. The advantage of bringing weighted average to viterbi decoder is that we get higher precision than 1 bin (20 cents), which is also why we prefer weighted_argmax to argmax.

My own experiment results are shown as below. Here is the original audio file: XingFuChuFa_IssueExp.zip

The original audio is a piece of synthesized singing voice by WORLD vocoder. Note that all dashed lines in the following screenshots represents the ground truth. All the results are obtained with a hop length of 5 milliseconds and a periodicity threshold at 0.2.

[Baseline] default viterbi decoder, with dithering and without weighted local average:

Viterbi decoder without dithering (looks like stepped shape):

Viterbi decoder without dithering and combined with weighted average (steps are gentler, but no significant improvements):

[Proposed] viterbi decoder without dithering, combined with weighted average, and using ReLU weights instead of sigmoid (much smoother than the above):

And also, here are the results by weighted argmax decoder, without dithering and using ReLU weights:

weights on these bins should not affect the result

Right, setting them to -inf is what makes sure they have zero density and don't affect the result. Otherwise, the maximum probability can be assigned to spurious noise in high- or low-frequencies. You see this as well in autocorrelation matrices.

it will be better to directly compute local average around a specific bin

Why not both? I encourage you to try not using a fmin and fmax and see how problematic the density at very high and low frequencies can be for noisy audio. You want something to make sure those bins don't end up as the maximum.

Viterbi decoder in the original implementation calls to_local_average_cents (see here), which means the two practices are combined by default

Their Viterbi algorithm doesn't do what they hope: it doesn't catch octave errors. My weighted argmax is functionally the same as their decoding method, but in O(n). My viterbi actually constrains the pitch velocity, preventing octave errors. Combining viterbi + weighted argmax is entirely reasonable. But what happens when you have a piano playing greater than an octave interval? Viterbi doesn't work there. It's honestly a bad default on my part just for that reason.

Performing a weighted average computation does not cost lots of time, since: 1. it is done in O(n) time; 2. it is done by batch computing in NumPy, while viterbi algorithm of librosa uses loops in Python and performs many indexing read/write of ndarrays

Viterbi is the thing that's O(nlogn), not weighted average. I'm not sure why librosa using for looks or arrays matters here.

using ReLU weights instead of sigmoid

I still don't understand this. Why are you applying ReLU to logits? Logits have arbitrary scale.

Your experiment tests the "smoothness" of the pitch. But is smoother better for all applications, or your application? SVS has long, held out notes. I imagine smoothing probably helps SVS. I need to see, e.g., improved RPA, RCA, etc. on a sizable dataset to be convinced of domain-agnostic improvement. I'll have a training + evaluation framework released in a couple months you could try that on.

But what happens when you have a piano playing greater than an octave interval? Viterbi doesn't work there. It's honestly a bad default on my part just for that reason.

I'm not proposing to remove or replace any of the current decoders. Viterbi + weighted argmax can be a new option to choose, as it is an optimization of the current default viterbi decoder. That does not mean changing the behavior of current decoders.

Why not both? I encourage you to try not using a fmin and fmax and see how problematic the density at very high and low frequencies can be for noisy audio. You want something to make sure those bins don't end up as the maximum.

When I say directly compute local average I am saying the difference between sigmoid and relu when converting logits to probs. This is about improving existing method to compute local average, but not whether to use the method of computing local average. All kinds of method are reasonable, but improvements can be applied to some of them. See my explanations below.

I still don't understand this. Why are you applying ReLU to logits? Logits have arbitrary scale.

ReLU may be misleading here. It actually does two things: 1. set all logits below 0 (masked as -inf) to 0; 2. keep other logits as the same. Sigmoid also maps -inf to 0, but it maps all values to [0, 1] with a non-linear function. When you calculate cents = (weighted_argmax.weights * probs).sum(dim=1) / probs.sum(dim=1) with probs computed from ReLU, you get the sample average within the local bins, because ReLU is a linear function on the x > 0 branch, unlike sigmoid. This may indicate the difference between image 3 and image 4 above.

Your experiment tests the "smoothness" of the pitch. But is smoother better for all applications, or your application?

The changes that I propose not only affect the smoothness of the pitch:

I agree that the current weighted argmax decoder should not be removed, as there are some situation that pitch can vary quickly in a short time span.
Besides smoothness, precision is more important. The current viterbi decoder only get bins from the logits; but bins are discrete integers, each presenting 20 cents. This means if the weighted average is not applied, it will only achieve a precision of 20 cents. As you can see in image 2 above, there are significant quantization errors in the results of viterbi decoder. CREPE use Gaussian-blurring to preserve the precision of pitch, but the current viterbi decoder lost it.
"Smoothness" is one of my results, but not my original purpose. All my thoughts came from my discovery that the viterbi decoder have some losses in precision, causing quantization errors. With dithering this is not very observable; but without dithering we see it clearly. My proposal fixes the quantization errors, resulting in higher precision and making the results more accurate, and also, smoother. Anyway, if dithering is something that must have, it can be preserved.

In short, these are things about how to improve viterbi and weighted argmax, but not whether we should use each of them. These are things about improving precision and accuracy of existing method, but not only about smoothening the results. I'm sorry if I caused any of your misunderstanding with probably misleading words and expressions.

I think averaging in logit space makes sense given its linearity. But masking out all values between -inf and 0 voilates the normality assumption that you are advocating for.

If your method improves precision, you need to demonstrate that empirically.

I don't appreciate the bolding, and am going to close this issue. If you want to make a pull request with weighted argmax + viterbi as an option, that's fine.

maxrmorrison / torchcrepe