lrq3000 / pyFileFixity

📂🛡️Suite of tools for file fixity (data protection for long term storage⌛) using redundant error correcting codes, hash auditing and duplications with majority vote, all in pure Python🐍
MIT License
133 stars 9 forks source link

file encoding on Linux / Windows #7

Open odavy37 opened 5 years ago

odavy37 commented 5 years ago

Dear Sir,

I am using pyFileFicity which is a very efficient tool.

But the output files for checksum / header / ECC seem to be OS depended (charset dependent), which can lead to errors.

For example, I computed checksums on Windows 10, then I tested the checksums on Linux (Debian buster, utf-8 encoding), and I found multiple errors. It appears that output file encoding in Windows was CP-1215, while reading utf-8 on Linux. It was easily fixed for the checksums output file, by transforming the file to utf-8. But I am wondering if header and ECC output files - computed on Windows, will work if I repair the file on Linux? Is there an option in pyfileFixity to set charset of reading / output files (utf-8)?

Best regards,

Olivier

lrq3000 commented 1 year ago

This is not normal, the encoding should be platform independent. I must have missed specifying an encoding in a few places, my bad, I am very sorry for this oversight.

It will take me some time to find the causes and fix them because the codebase is huge, but it will be done eventually.

Note to self: in headers, also add python version, OS and CPU architecture, to future proof even more (alongside pyFileFixity version).

lrq3000 commented 1 year ago

But I am wondering if header and ECC output files - computed on Windows, will work if I repair the file on Linux? Is there an option in pyfileFixity to set charset of reading / output files (utf-8)?

Checksum files (generated by rfigc.py I presume) are very different from ECC files: checksum files are stored in CSV files, whereas ECC files are binary files. So no, there should be no encoding issue at all, and I was particularly careful about the ECC implementation. ECC files are hence handled at a low level and their formatting should be consistent across platforms. They even store their own filepath metadata for future filescraping possibility, and the formatting of the path string was made consistent across platforms (and it was hella hard!).

For the CSV/rfigc.py, I also was careful, but it seems not enough: the issue is that with Python 2 CSV were also managed as binary data streams, so it was all fine, but starting with Python 3, they are considered as text streams, and hence there is now a possibility to use an encoding. Furthermore, line returns are inconsistent, and it seems this inconsistency is from the CSV writer itself, which sometimes just ignores all newline and linedelimiter settings.

So I'm going to try to fix these issues you mention the best I can, but I cannot guarantee that rfigc.py can produce the exact same CSV files across platforms, because by design (RFC4180), CSV are made to produce different outputs per platforms.