lrq3000 / pyFileFixity

📂🛡️Suite of tools for file fixity (data protection for long term storage⌛) using redundant error correcting codes, hash auditing and duplications with majority vote, all in pure Python🐍
MIT License
133 stars 9 forks source link

Infinite compression #9

Closed wlwatkins closed 3 years ago

wlwatkins commented 3 years ago

Hi,

This is not an issue just a question. If I have an image, I make an ecc file from it. I then purposely remove the maximum number of bits before the file is too much corrupt. Then I make a second ecc iteration, and again I remove the maximum number of bits before it is too corrupt.

From this sophism I should be able to infinitely compress a file.

I know this is wrong, but why?

lrq3000 commented 3 years ago

Hello,

Oh I'm surprised people still use my software, I unfortunately couldn't finish migrating it to Python v3 although there's not much left to do, but missing time :-/

About your thought experiment, I think it's because you forget that you need to keep and use all the ECC files you generated along the way, and also that the Singleton Bound limits recovery to (n-k)/2, in other words 50% of the ECC length can be recovered.

Let's say you use an ECC code that with a length exactly as long as the number of characters it protects, eg, for 8 characters of the original file, the ECC generates 8 characters that can recover 4 characters. So now you can indeed delete 4 characters without loss. Then you re-encode the 4 remaining characters using the same ECC scheme, so it generates a 4 characters ECC that can recover 50% = 2 characters of the input file. You can then delete 2 characters and what remains are 2 characters.

In the end, you get the following:

With these 3 files, you can fully recover your complete original file in theory. But in total you are storing 8+4+2=14 characters, whereas your original file was 8 characters, so there is no compression at all.

Furthermore, it's not a more interesting alternative to first-level ECC code in terms of resiliency, because here despite the multi-level ECC code being shorter (14 characters vs 16 characters for initial file + first-level ECC code), here there is no resiliency against errors, if any characters of any of the 3 files of this multi-level ECC encoding gets corrupted, it's game over.