Closed yqzhishen closed 2 years ago
There's some good thoughts here. I've visited these ideas before, so here's what I've found:
One thing you could consider doing is replacing the dithering with localized linear smoothing as done in weighted_argmax
. That could smooth the quantization error without randomness. So that would mean no O(nlogn) time cost, no dithering, no quantization error, and no domain-specific assumptions on pitch velocity. That's probably our best bet.
I think there is something that I need to explain more clearly.
ReLU
is that some bins are masked as -inf
(for example, bins below fmin
and above fmax
), and weights on these bins should not affect the result. What I really mean is that it will be better to directly compute local average around a specific bin. There is density in the harmonics, but within a small range of 9 bins, the distribution should be regarded as normal.to_local_average_cents
(see here), which means the two practices are combined by default. Performing a weighted average computation does not cost lots of time, since: 1. it is done in O(n) time; 2. it is done by batch computing in NumPy, while viterbi algorithm of librosa uses loops in Python and performs many indexing read/write of ndarrays. The advantage of bringing weighted average to viterbi decoder is that we get higher precision than 1 bin (20 cents), which is also why we prefer weighted_argmax
to argmax
.My own experiment results are shown as below. Here is the original audio file: XingFuChuFa_IssueExp.zip
The original audio is a piece of synthesized singing voice by WORLD vocoder. Note that all dashed lines in the following screenshots represents the ground truth. All the results are obtained with a hop length of 5 milliseconds and a periodicity threshold at 0.2.
[Baseline] default viterbi decoder, with dithering and without weighted local average:
Viterbi decoder without dithering (looks like stepped shape):
Viterbi decoder without dithering and combined with weighted average (steps are gentler, but no significant improvements):
[Proposed] viterbi decoder without dithering, combined with weighted average, and using ReLU weights instead of sigmoid (much smoother than the above):
And also, here are the results by weighted argmax decoder, without dithering and using ReLU weights:
weights on these bins should not affect the result
Right, setting them to -inf
is what makes sure they have zero density and don't affect the result. Otherwise, the maximum probability can be assigned to spurious noise in high- or low-frequencies. You see this as well in autocorrelation matrices.
it will be better to directly compute local average around a specific bin
Why not both? I encourage you to try not using a fmin and fmax and see how problematic the density at very high and low frequencies can be for noisy audio. You want something to make sure those bins don't end up as the maximum.
Viterbi decoder in the original implementation calls to_local_average_cents (see here), which means the two practices are combined by default
Their Viterbi algorithm doesn't do what they hope: it doesn't catch octave errors. My weighted argmax is functionally the same as their decoding method, but in O(n). My viterbi actually constrains the pitch velocity, preventing octave errors. Combining viterbi + weighted argmax is entirely reasonable. But what happens when you have a piano playing greater than an octave interval? Viterbi doesn't work there. It's honestly a bad default on my part just for that reason.
Performing a weighted average computation does not cost lots of time, since: 1. it is done in O(n) time; 2. it is done by batch computing in NumPy, while viterbi algorithm of librosa uses loops in Python and performs many indexing read/write of ndarrays
Viterbi is the thing that's O(nlogn), not weighted average. I'm not sure why librosa using for looks or arrays matters here.
using ReLU weights instead of sigmoid
I still don't understand this. Why are you applying ReLU to logits? Logits have arbitrary scale.
Your experiment tests the "smoothness" of the pitch. But is smoother better for all applications, or your application? SVS has long, held out notes. I imagine smoothing probably helps SVS. I need to see, e.g., improved RPA, RCA, etc. on a sizable dataset to be convinced of domain-agnostic improvement. I'll have a training + evaluation framework released in a couple months you could try that on.
But what happens when you have a piano playing greater than an octave interval? Viterbi doesn't work there. It's honestly a bad default on my part just for that reason.
I'm not proposing to remove or replace any of the current decoders. Viterbi + weighted argmax can be a new option to choose, as it is an optimization of the current default viterbi decoder. That does not mean changing the behavior of current decoders.
Why not both? I encourage you to try not using a fmin and fmax and see how problematic the density at very high and low frequencies can be for noisy audio. You want something to make sure those bins don't end up as the maximum.
When I say directly compute local average
I am saying the difference between sigmoid and relu when converting logits to probs. This is about improving existing method to compute local average, but not whether to use the method of computing local average. All kinds of method are reasonable, but improvements can be applied to some of them. See my explanations below.
I still don't understand this. Why are you applying ReLU to logits? Logits have arbitrary scale.
ReLU
may be misleading here. It actually does two things: 1. set all logits below 0
(masked as -inf
) to 0
; 2. keep other logits as the same. Sigmoid also maps -inf
to 0
, but it maps all values to [0, 1] with a non-linear function. When you calculate cents = (weighted_argmax.weights * probs).sum(dim=1) / probs.sum(dim=1)
with probs
computed from ReLU
, you get the sample average within the local bins, because ReLU
is a linear function on the x > 0
branch, unlike sigmoid
. This may indicate the difference between image 3 and image 4 above.
Your experiment tests the "smoothness" of the pitch. But is smoother better for all applications, or your application?
The changes that I propose not only affect the smoothness of the pitch:
In short, these are things about how to improve viterbi and weighted argmax, but not whether we should use each of them. These are things about improving precision and accuracy of existing method, but not only about smoothening the results. I'm sorry if I caused any of your misunderstanding with probably misleading words and expressions.
I think averaging in logit space makes sense given its linearity. But masking out all values between -inf and 0 voilates the normality assumption that you are advocating for.
If your method improves precision, you need to demonstrate that empirically.
I don't appreciate the bolding, and am going to close this issue. If you want to make a pull request with weighted argmax + viterbi as an option, that's fine.
Here are three topics related to the postprocessing methods.
1. Why use sigmoid here?
https://github.com/maxrmorrison/torchcrepe/blob/9aecc86f5f3ef908bd75656368639686480800e0/torchcrepe/decode.py#L44-L49 As mentioned in the original CREPE paper, the frequency bins are "Gaussian-blurred" by the ground truth f0 in the training label. As the unbiased estimate of the expectation of the normal distribution is the sample average, the converting method should be
relu
instead ofsigmoid
, i.e., computing the direct average of local bins with positive value. Also, the original TensorFlow repository uses local average instead of sigmoid. I did my own experiments and proved that results produced by direct average of logits is much smoother than the current version, even without dithering and filtering.2. Combine viterbi and weighted argmax.
The current viterbi decoder searches frequency bins along the best path, but in a precision of 20 cents, since viterbi algorithm is for discrete states: https://github.com/maxrmorrison/torchcrepe/blob/9aecc86f5f3ef908bd75656368639686480800e0/torchcrepe/decode.py#L76-L80 However, we can then apply what the weighted argmax decoder does, as something called
weighted_viterbi
for example. For short, this means replacing the first line ofargmax
operation in the weighted argmax decoder withviterbi
. In this way we got a smoother result without quantization errors, while not depending on dithering. The original TensorFlow repository also implemented this as the default behavior of the viterbi decoder.3. Consider disabling dithering or making it optional.
As discussed above, dithering seems to do more harm than good, especially to weighted decoders. In my own experiments, the
weighted_viterbi
decoder produce quite smooth results without dithering and filtering, which are also more accurate without random errors broughted by dithering.There are two ways to solve this problem in my opinion:
Add an option to let the user choose whether to apply dithering.
What are your thoughts about these topics? I'm submitting this issue because I think it would be better to have some discussions before I could make a pull request on it.