laktak / chkbit-py

Check your files for data corruption
MIT License
96 stars 6 forks source link

Feature request: Parity files so that backups can be healed #4

Open Joshfindit opened 2 years ago

Joshfindit commented 2 years ago

I got here from https://unix.stackexchange.com/questions/136947/protecting-data-against-bit-rot/533728#533728 and think you could successfully add the functionality to chkbit-py, making it more powerful.

laktak commented 2 years ago

I know of Pararchive files - but they are not small or lightweight (which is what this tool tries to be).

Couldn't you run that in parallel to chkbit? What would be gained by integration? AFAIK you would have to create and store one (several?) PAR2 files for each file in your original folder. I don't think the current .chkbit file would be suitable for that.

Joshfindit commented 2 years ago

Parity generation is heavy, as is the verification. Healing is light enough, however.

What would be gained by integration? Great question.

Tbh, I don’t think PAR2 files are actually the right choice for integration since they have issues with subfolders and small files. However, I think that using the same process to create parity data in python is the winner. .chkbit could be a folder instead, and store the parity data within it.

Defaults could be:

laktak commented 2 years ago

I'm not yet sure if it would be practical from a resource point of view but it does sound interesting. Would you be willing to create a PR for this feature?

It's been a while since I used par2 files - do you know how much space they usually need (for example for a 3mb jpeg or a 50mb mp4)?

Joshfindit commented 2 years ago

I don’t currently program in Python, but I could take a shot at it using an existing reed-solomon library if there are any that align with the ideals of this repo.

but I do not have the skills required to determine which packages are viable, but a quick Google pointed me to reedsolo and unireedsolomon as possibilities

Joshfindit commented 2 years ago

As to the question of size: It’s based on a percentage of the size of the data to be recovered. PAR/PAR2 files have some overhead, but that overhead is not required.

Backblaze, for example, stores raw parity data on 3 drives out of 20 and end up with “five 9’s” of data durability. ZFS uses about 15%. Personally, I like to have about 30% of the original data in parity data, but most people are not as “wasteful”. Even 10% gives a lot of benefits.

dia3olik commented 2 years ago

This feature would be great and it could be invoked only optionally so everyone would be happy ;-)

I also think using a subfolder to store the parity data would be a great choice.

Using a folder named .chkbit which would remain hidden as mentioned would be perfect imho.

laktak commented 1 year ago

If there is interest in this feature I will accept a PR.

Please discuss your implementation with me or accept changes to integrate it and understand that it should