ld-decode tbc/efm from lds and ldf is not consistent

homercartman commented 1 year ago

At a6b8597a96170474680d7070e4afb3fb55b41c60 lds compressed as ldf (with ld-cut) might not produce consistent tbc/efm results with ld-decode against the lds.

Steps to reproduce:

capture a pal clv LD to 10 bits packed lds with a DdD to a file, say cap.lds
do a ld-decode --pal cap.lds ref --length N with a reasonably high N, like 100
cut and compress with ld-cut --pal --start 0 --length N cap.lds cap_trimmed.ldf
do a ld-decode --pal cap_trimmed.ldf --length N test
sha1s of ref.tbc and test.tbc might not match, binary diff will show errors where bytes are off by +1/-1.

Note: those errors might be concealed in a specific zone of the test.tbc file, ie: no errors before nor after this zone.

atsampson commented 1 year ago

My guess on this one is that ld-ldf-reader isn't seeking to quite the right location - we could test it by comparing its output against the Python file reading code...

homercartman commented 1 year ago

@atsampson it might be deeper than that. As I said in the first comment, the erroneous zone can happen several dozens of frames after the first frame ever (--start 0). How could ld-ldf-reader have seek jumps or context loss while reading the ldf file continuously?

Also, the ldf read errors are always off by -1/+1 compared to lds : it's there but it's somehow contained. A wild seek error would give no alignment between both ldf and lds being read. I suspect either:

something like rounding error in ld-decode while reading the file in ldf format (cannot explain it)
or, erroneous flac data flow when compressing from lds to ldf (if so, it might hide other bugs related to lds reading / ldf writing)

atsampson commented 1 year ago

My reasoning was based on previous faults we've seen - ld-decode requests a chunk of data from ld-ldf-reader that contains the region it's after plus a margin on either side, then searches through it to find the sync pulses. So if the positioning was slightly off, it will still find the sync pulses, but with a bit of variation because it was starting in a different place. It's also pretty unlikely that the FLAC compression itself is broken, because the ld-compress script offers a verification option which several of us use...

However, I've just modified ld-decode to print hashes of the blocks it reads. First, the hashes were different for blocks read from .lds vs .ldf files - but that turns out to be because unpack_data_4_40 was accidentally returning int64s rather than int16s, which I think is harmless. Fixing that makes them identical, so ld-ldf-reader is working OK.

What's more concerning is that running the same (100-frame PAL) decode multiple times returns different output, both for .lds and .ldf input:

$ sha256sum out/tmp2-ld*.tbc
04fd70e581f23d42344aa36917a9c2117afd6dc7cd3d44def0200845ab8ad97f  out/tmp2-ldf1.tbc
ab57e7550f45fea7090f0349468afebd06f92919fdc068690a4d33b86e95eecf  out/tmp2-ldf2.tbc
31405cb34c1a03aee0f39f670333ef69008d29d95dbf991cf970b5cff02caa9f  out/tmp2-ldf3.tbc
c82118329f721074bbe59195db3de73c78c4a1207a262866e9386311fe384908  out/tmp2-ldf4.tbc
c25f27bc7e281d885a27d49069d9883495288715ddc816f508771fd79d93213e  out/tmp2-lds1.tbc
c25f27bc7e281d885a27d49069d9883495288715ddc816f508771fd79d93213e  out/tmp2-lds2.tbc
b4561a474b37c57145e1f26af1314cf3ffc7f142d53573151aa37e04f1b85238  out/tmp2-lds3.tbc
c82118329f721074bbe59195db3de73c78c4a1207a262866e9386311fe384908  out/tmp2-lds4.tbc

So there's some nondeterminism creeping in somewhere. Specifying -t1 to reduce parallelism doesn't prevent this. @happycube, any ideas?

atsampson commented 1 year ago

Here's a 1-frame PAL sample that reproduces the problem: vbos-clip.ldf.zip

happycube commented 1 year ago

Wow, thanks... added that to ld-decode-testdata. I see different decodes as well, and more than I expected.

On Sun, Dec 25, 2022 at 5:59 PM Adam Sampson @.***> wrote:

Here's a 1-frame PAL sample that reproduces the problem: vbos-clip.ldf.zip https://github.com/happycube/ld-decode/files/10301108/vbos-clip.ldf.zip

— Reply to this email directly, view it on GitHub https://github.com/happycube/ld-decode/issues/815#issuecomment-1364806408, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXFCFHIHRDA6M4ZMTWOK3WPD3WHANCNFSM6AAAAAASKYJN2E . You are receiving this because you were mentioned.Message ID: @.***>

oyvindln commented 1 year ago

We've encounteded some random non-deterministic stuff with vhs-decode too in the past, suspected it was some threading/race condition issue but never got to the bottom of it.

Do you get a different result if you disable the use of pyfftw and just use numpy's fft implementation and/or change the thread code to use thread instead of process (the latter also requires not using pyfftw)?

I know we had to disable some of that on windows as it caused segfaults and crashes so maybe there are some hidden issues there on other systems too.

Alternatively, there could be some change to the output if there is anything in the decoder structs that is altered inside the threaded code or by the main thread when the decode threads are active.

happycube commented 1 year ago

I think this has to do with a bug in rerunning demod after adjusting MTF, and it doesn't know what to do if it gets another decoded block with the one it already didn't care about... I'll add some checks/counters/etc and see about blocking it out.

atsampson commented 1 year ago

Yes, I tried disabling PyFFTW earlier - it didn't make a difference. --threads 0 gets rid of the variation on my machine but I'm not sure it's settling on the "right" result. I made it log hashes of ndarrays in a few places and got as far as demod05 being different in redefine_linelocs_hsync - it seems to be timing-dependent since sometimes adding a logging call will also make the output stable.

atsampson commented 1 year ago

After a bit more tracing I'm on the same page as happycube. With the sample I posted above, the calls to DemodCache.read are:

INFO - DemodCache.read begin=0 length=1081344 MTF=1
INFO - DemodCache.read begin=0 length=1081344 MTF=0.3557894736842105
INFO - DemodCache.read begin=983040 length=1081344 MTF=0.3557894736842105
INFO - DemodCache.read begin=983040 length=1081344 MTF=0.39157894736842114
INFO - DemodCache.read begin=1769472 length=1081344 MTF=0.39157894736842114
INFO - DemodCache.read begin=2588672 length=1081344 MTF=0.38412280701754364

read requests blocks from the workers (including some readahead), waits until it's got all the blocks it wants, then returns. But it will use a block with a different MTF from what it requested provided it's within DemodCache.MTF_tolerance, which is currently 0.05. So for all but the first call above, read returns a random mix of blocks with different MTFs, depending on how far the worker threads have got with the blocks requested before...

Setting MTF_tolerance to a very small value (I tried 1e-20) fixes this, but there's probably a better approach?

happycube commented 1 year ago

Cutting the tolerance to .005 should minimize the effects, but I think I'll dig into things properly this weekend so that this is all a bit better controlled and lay the groundwork for the auto-filter-adjusting work I've been meaning to do for a long time now.

happycube / ld-decode

ld-decode tbc/efm from lds and ldf is not consistent #815