lrq3000 / pyFileFixity

📂🛡️Suite of tools for file fixity (data protection for long term storage⌛) using redundant error correcting codes, hash auditing and duplications with majority vote, all in pure Python🐍
MIT License
129 stars 9 forks source link

Non latin-1 filenames are not supported #13

Open spock opened 10 months ago

spock commented 10 months ago

Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).

While doing so, came across this exception:

Traceback (most recent call last):
  File "/home/user/.local/bin/pff", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/pff.py", line 108, in main
    return saecc_main(argv=subargs, command=fullcommand)
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 574, in main
    relfilepath_ecc = compute_ecc_hash_from_string(relfilepath, ecc_manager_intra, hasher_intra, max_block_size, resilience_rate_intra)
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 203, in compute_ecc_hash_from_string
    fpfile = BytesIO(b(string))
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/lib/_compat.py", line 36, in b
    return codecs.latin_1_encode(x)[0]
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 16-26: ordinal not in range(256)

Looking at the code, it seems that latin-1 is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:

if sys.version_info < (3,):
    def b(x):
        return x
else:
    import codecs
    def b(x):
        if isinstance(x, _str):
            return codecs.latin_1_encode(x)[0]  # <-- here
        else:
            return x

Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of latin-1 encoding.
Example string: зображення.

pyFileFixity version 3.1.4 installed with pip. I'm on Python 3.10.12.

spock commented 10 months ago

Replacing

return codecs.latin_1_encode(x)[0]
# codecs.latin_1_encode("зображення")

with

return codecs.utf_8_encode(x)[0]
# codecs.utf_8_encode("зображення")

will work, but will likely raise an exception elsewhere, where latin-1 is expected?

lrq3000 commented 10 months ago

Thank you for your feedback!

I think your change should be fine. There is an exhaustive unit test, you can try to run it with your change, if it works then you are good.

I will try to do that myself but no guarantee, i have a big backlog of projects maintenance...

12 nov. 2023 19:59:34 Bogdan @.***>:

Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).

While doing so, came across this exception:

Traceback (most recent call last): File "/home/user/.local/bin/pff", line 8, in sys.exit(main()) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/pff.py", line 108, in main return saecc_main(argv=subargs, command=fullcommand) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 574, in main relfilepath_ecc = compute_ecc_hash_from_string(relfilepath, ecc_manager_intra, hasher_intra, max_block_size, resilience_rate_intra) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 203, in compute_ecc_hash_from_string fpfile = BytesIO(b(string)) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/lib/_compat.py", line 36, in b return codecs.latin_1_encode(x)[0] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 16-26: ordinal not in range(256)

Looking at the code, it seems that latin-1 is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:

if sys.version_info < (3,): def b(x): return x else: import codecs def b(x): if isinstance(x, _str): return codecs.latin_1_encode(x)[0] # <-- here else: return x

Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of latin-1 encoding. Example string: зображення.

pyFileFixity version 3.1.4 installed with pip. I'm on Python 3.10.12.

— Reply to this email directly, view it on GitHub[https://github.com/lrq3000/pyFileFixity/issues/13], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIRFXVYFC3FXXKM3VJLUTDYEEMBLAVCNFSM6AAAAAA7IGI56OVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DSNJSGY4DOOA]. You are receiving this because you are subscribed to this thread. [Image de pistage][https://github.com/notifications/beacon/AAIRFXSTXAGCIF7WG3QZUSLYEEMBLA5CNFSM6AAAAAA7IGI56OWGG33NNVSW45C7OR4XAZNFJFZXG5LFVJRW63LNMVXHIX3JMTHHNFOFLY.gif]

lrq3000 commented 10 months ago

Or you know what? If you can make a PR, then an automated CI workflow will launch a unit test online, so if you make a PR yol don't need to run the unit test yourself. And it will allow to credit you properly for this change :-)

12 nov. 2023 21:42:28 Stephen L. @.***>:

Thank you for your feedback!

I think your change should be fine. There is an exhaustive unit test, you can try to run it with your change, if it works then you are good.

I will try to do that myself but no guarantee, i have a big backlog of projects maintenance...

12 nov. 2023 19:59:34 Bogdan @.***>:

Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).

While doing so, came across this exception:

Traceback (most recent call last): File "/home/user/.local/bin/pff", line 8, in sys.exit(main()) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/pff.py", line 108, in main return saecc_main(argv=subargs, command=fullcommand) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 574, in main relfilepath_ecc = compute_ecc_hash_from_string(relfilepath, ecc_manager_intra, hasher_intra, max_block_size, resilience_rate_intra) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 203, in compute_ecc_hash_from_string fpfile = BytesIO(b(string)) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/lib/_compat.py", line 36, in b return codecs.latin_1_encode(x)[0] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 16-26: ordinal not in range(256)

Looking at the code, it seems that latin-1 is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:

if sys.version_info < (3,): def b(x): return x else: import codecs def b(x): if isinstance(x, _str): return codecs.latin_1_encode(x)[0] # <-- here else: return x

Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of latin-1 encoding. Example string: зображення.

pyFileFixity version 3.1.4 installed with pip. I'm on Python 3.10.12.

— Reply to this email directly, view it on GitHub[https://github.com/lrq3000/pyFileFixity/issues/13], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIRFXVYFC3FXXKM3VJLUTDYEEMBLAVCNFSM6AAAAAA7IGI56OVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DSNJSGY4DOOA]. You are receiving this because you are subscribed to this thread. [Image de pistage][https://github.com/notifications/beacon/AAIRFXSTXAGCIF7WG3QZUSLYEEMBLA5CNFSM6AAAAAA7IGI56OWGG33NNVSW45C7OR4XAZNFJFZXG5LFVJRW63LNMVXHIX3JMTHHNFOFLY.gif]

lrq3000 commented 10 months ago

Ok so I remember why it is in latin-1, because the software encodes byte by byte, and a byte is 255 characters maximum, so the idea was to use latin-1 as a codec if necessary but normally these should be treated as bytes.

This is an old code that remains from the Python 2/3 compatibility era, now since Py2 support is dropped everywhere, I should rewrite this code to be more Py3 idiomatic.

Can you please maybe share a minimum example file that produces this issue? Just a simple text file with some random non latin-1 characters should be enough (I'll try to make some myself but just in case it's good if you can provide an example file too).

lrq3000 commented 10 months ago

Ok I can reproduce the issue using the example filename you provided, thank you very much. I can't believe I never tested a non-latin-1 filename. I will work on it, hopefully it's not too complicated.