phoboslab / qoa

The “Quite OK Audio Format” for fast, lossy audio compression
MIT License
748 stars 41 forks source link

Some audio data seems to trigger a bug in the encoder/decoder #25

Closed Kilklang26 closed 4 months ago

Kilklang26 commented 1 year ago

Hi, I just tried this compression and it works fine on hundred of different files. However on a specific one, the reconstructed sound has bugs in it. If I lower the volume of the source file, the bug does not appear anymore. Check below the source file (test.WAV) and the reconstructed file with the bugs (raw PCM 16 bits signed mono)

https://kodamo.org/share2cRGh/test.WAV https://kodamo.org/share2cRGh/reconst.raw

Is it possible that this exact sound triggers some bug in the encode/decode algorithm?

phoboslab commented 1 year ago

I can reproduce the problem here. It looks like the predictor weights go into a feedback loop at some point and begin to oscillate wildly.

I will investigate how to fix this. Thanks for the test case!

bdwjn commented 1 year ago

It looks like the QOA format itself is broken, causing this bug.

The spec says delta = r >> 4. These residuals may be negative, and right-shifting negative numbers is implementation-defined. Most systems shift in sign bits, for two's complement this means 7>>4 == 0 but -7>>4 == -1, this can throw the weights out of balance.

I replaced >>= 13 with /= 8192 and >> 4 with / 16, and that removed one of the two clicks in the above file.

The rest is probably due to rounding errors in the same shifts/divisions. Maybe there should be some damping on the weights to prevent them from growing too large?

phoboslab commented 1 year ago

It is my understanding that LMS (or other) filters can easily become unstable with 3 or more terms. I've seen this in my experiments that led up to QOA - e.g. when precomputing the coefficients for an entire frame with Levinson Recursion. This was also one of the reasons that WavPack does multiple passes with shorter terms (1 or 2 weights) instead.

I (falsely) assumed this wouldn't be possible if the weights are updated at each sample. Though I'm not convinced that this is to blame on rounding errors or bit shifts (instead of divisions).

The given test sample sounds quite bad when encoded with QOA, even without the clicks. The prediction is doing an extremely bad job here and it's not (only) because of the high frequency. This is not the case with pure sine waves for instance. I couldn't yet wrap my head around "why" this is.

In any case, this is very unfortunate. I'm looking for ways to fix this (or workaround) in the encoder, without needing to change the format itself.

phoboslab commented 1 year ago

So here's a mildly stupid idea: in this specific case weights[0] grows quite large. This doesn't happen with any of the test samples. weights[0] stays mostly below 8192 and never ends up above 16384. In normal operation, most of the contributions for the predictor come from the more recent weights[3] and weights[2].

As an experiment: just resetting the weights at the beginning of a frame when weights[0] has grown too large "fixes" the problem with this test case and doesn't affect any of the bandcamp/, sqam/ or oculus_audio_pack/ tests in any way.

Adding this before writing the current weights to the stream here results in a clean encoding of the test case:

if (abs(qoa->lms[c].weights[0]) > 16384) {
    qoa->lms[c].weights[0] = 0;
    qoa->lms[c].weights[1] = 0;
    qoa->lms[c].weights[2] = -(1<<13);
    qoa->lms[c].weights[3] =  (1<<14);
}

Of course this is rather inelegant, but I think the approach itself may have some merit. Maybe there's a reliable way to detect these cases; or maybe there's a way to "normalize" the weights at the beginning of each frame?

This still relies on the assumption that the feedback loop exploding needs more than 5120 samples. I.e. if this whole thing could happen within just one frame, this wouldn't help...

bdwjn commented 1 year ago

I've been doing the following:

1) qoa_lms_t tmp = *qoa;, and set tmp.lms.weights to { 0, 0, -1, 2 } 2) qoa_encode_frame(..., qoa, ...) with the weights obtained from the previous frame 3) qoa_encode_frame(..., &tmp, tmp_buffer) 4) if (tmp.error < qoa->error) copy &tmp to qoa, and tmp_buffer to bytes+p

It's ugly doing two passes, but pSNR does go up by a good amount this way... dark_narrow_road went from to 51.66 to 71.44 dB.

pSNR isn't a great way to compare audio though. I've been running the outputs through PEAQ tests and the perceptual quality only seems to be slightly better, evne worse in a few frames. I can also hear clicks on the frame boundaries on the test sample from this Issue.

Another idea I had was running frame N-1 with weights set to {0,0,-1,2} and using those output weights to initialize frame N. Didn't work so well...

phoboslab commented 1 year ago

Pre-computing the weights for each frame using Levinson Recursion also fixes this test case. Similarly to your approach it slightly improves some of the test samples, while others got worse. I'm not sure if I want to go down this road...

Here's an experimental branch: https://github.com/phoboslab/qoa/tree/precompute_weights

bdwjn commented 1 year ago

Whatever you do to the weights, it can only happen at the start of a frame, and the spec forces a 1/16th weight update per-sample. By the time you can adjust, you already have up to 5120 bad samples.

The only thing an encoder can change within a frame are the scalefactors/residuals, calculated from the original samples. Maybe there's some preprocessing we can do there, similar to noise shaping but for the weights? To bring the weights down mid-frame, generate an anti-signal from the negative weights and mix that in with the sample?

amstan commented 1 year ago

Yeah, something certanly wonky with the lms calculations. I've been trying to implement the decoder in pure python as a learning experiment and this is one place where I'm getting huge values that will not fit in the data type. I'm guessing C relies on the modular arithmetic and it still works somehow (but causes occasional pops?).

I can reply back with the sample file and exact position where i saw it happen if needed.

Kilklang26 commented 1 year ago

The pre-computed weights branch fixes the test case but breaks many other sounds with clicks (maybe that's still the right route but needs tweaking?). With the original code I notice the noise is growing just before the click, so I also think that a dampening somewhere in the code like bdwjn suggested should help (sorry I didn't have time to study the QOA format yet).

I made a quick program to generate test sounds in a controlled way and check how the encoder behaves. With a pure sine, no problems. However, if I add a 2nd harmonic to the sound, the issue shows up.

First test made with the code below, two harmonics, the noise is growing over time but not exploding: https://kodamo.org/share2cRGh/freq_i0.2.wav

If I increase the frequency without touching other paramters (freq=i * 0.3 instead of i * 0.2) the noise grows more quickly and clicks happen: https://kodamo.org/share2cRGh/freq_i0.3.wav

`

define NBSAMPLES 200000

FILE *test = fopen("test.raw", "wb");

for (unsigned i = 0; i < NBSAMPLES; i++) { signed short val = sin(i 0.2f) 20000; // fundamental val += sin(i 0.2f 2) * 6000; // 2nd harmonic fwrite(&val, 2, 1, test); } fclose(test); `

bdwjn commented 1 year ago

I made some plots of the weights for those files:

freq_i0.2.wav: https://i.imgur.com/L6kJu4v.png freq_i0.3.wav: https://i.imgur.com/HGzlXb5.png freq_i0.3.wav zoomed in on the first click: https://i.imgur.com/hVmDbLY.png

You can see how it starts to get unstable a few wavelengths before finally exploding.

The simplest way of fixing this would be (if the spec allowed it) to adaptively lower the learning rate. Here's freq_i0.3.wav with a µ of 1/64 instead of 1/16: https://i.imgur.com/66c7eIG.png

phoboslab commented 1 year ago

adaptively lower the learning rate

I like the idea in general. We may even be able to increase the learning rate for some cases and improve the quality. If that's the case (and the adaptation logic is simple enough) it would be worth a spec update.

My main question is: How would that work, exactly? What does it adapt to?

Whatever you do to the weights, it can only happen at the start of a frame, and the spec forces a 1/16th weight update per-sample. By the time you can adjust, you already have up to 5120 bad samples.

True. But the question remains: is it possible to force the encoder into a bad state within fewer than 5120 samples? Maybe adjusting at the start of a frame is sufficient?

FWIW: I generated some more WAV files using Kilklang26's approach and have yet to encounter one that isn't "fixed" by resetting the weights at the beginning of a frame if weights[0] > 16384. Of course this doesn't proof anything... I should probably start collecting and/or generating a few hundred thousand WAV files and do some automated testing.

phoboslab commented 1 year ago

I'm currently running the encoder through my local music library. This will take a while to complete, but even after the first 6000 tracks it identified a few where the LMS goes haywire (though it always catches itself immediately after). With timestamps:

And some false positives, where briefly weights[0] > 16384 but didn't actually cause a pop:

So the culprit is always(?) a high pitched tone. Sustained for too long and it explodes.

Kilklang26 commented 1 year ago

I used QOA on samples of instruments I recorded, that's why it exhibits the problem way more than real music, there are a lot of steady tones and some of them are high pitched. What do I need to change in the code to do your "weights[0] > 16384" change? I can try running it on my sample collection and report the results.

phoboslab commented 1 year ago

I used QOA on samples of instruments I recorded

That totally makes sense and it's absolutely a use-case that QOA should be able to support. So far, checking for a large weight[0] and somehow mitigate it, looks like a viable solution. However, resetting all weights to the initial values is like a sledgehammer and will probably introduce some audible artifacts, too.

For the tests that I'm running, I just put

if (abs(qoa->lms[c].weights[0]) > 16384) {
    printf("LMS_ERROR (%d)\n", qoa->lms[c].weights[0]);
}

at qoa.h, line 445 and look for LMS_ERROR in the output.

For the workaround/"fix" I had

if (abs(qoa->lms[c].weights[0]) > 16384) {
    qoa->lms[c].weights[0] = 0;
    qoa->lms[c].weights[1] = 0;
    qoa->lms[c].weights[2] = -(1<<13);
    qoa->lms[c].weights[3] =  (1<<14);
}

at qoa.h, line 370

phoboslab commented 1 year ago

I have just pushed a workaround to master. I found that 1) the total power of the weights is a good indicator that the LMS is about to explode 2) resetting the weights to 0 worked best.

This still introduces audible artifacts when the weights reset. It prevents the LMS from exploding, but is far from perfect :/

phoboslab commented 9 months ago

Above commits introduces a better method to prevent the weights from growing too large: the weights are summed after each encoded sample and add to the error "rank" for the current scale factor. In problem cases a scale factor that keeps the weights lower is chosen.

This still introduces noise in the problem cases, but it's more uniform and not as abrupt as resetting all weights to zero.