Open spock opened 10 months ago
Replacing
return codecs.latin_1_encode(x)[0]
# codecs.latin_1_encode("зображення")
with
return codecs.utf_8_encode(x)[0]
# codecs.utf_8_encode("зображення")
will work, but will likely raise an exception elsewhere, where latin-1
is expected?
Thank you for your feedback!
I think your change should be fine. There is an exhaustive unit test, you can try to run it with your change, if it works then you are good.
I will try to do that myself but no guarantee, i have a big backlog of projects maintenance...
12 nov. 2023 19:59:34 Bogdan @.***>:
Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).
While doing so, came across this exception:
Traceback (most recent call last): File "/home/user/.local/bin/pff", line 8, in
sys.exit(main()) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/pff.py", line 108, in main return saecc_main(argv=subargs, command=fullcommand) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 574, in main relfilepath_ecc = compute_ecc_hash_from_string(relfilepath, ecc_manager_intra, hasher_intra, max_block_size, resilience_rate_intra) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 203, in compute_ecc_hash_from_string fpfile = BytesIO(b(string)) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/lib/_compat.py", line 36, in b return codecs.latin_1_encode(x)[0] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 16-26: ordinal not in range(256) Looking at the code, it seems that latin-1 is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:
if sys.version_info < (3,): def b(x): return x else: import codecs def b(x): if isinstance(x, _str): return codecs.latin_1_encode(x)[0] # <-- here else: return x
Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of latin-1 encoding. Example string: зображення.
pyFileFixity version 3.1.4 installed with pip. I'm on Python 3.10.12.
— Reply to this email directly, view it on GitHub[https://github.com/lrq3000/pyFileFixity/issues/13], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIRFXVYFC3FXXKM3VJLUTDYEEMBLAVCNFSM6AAAAAA7IGI56OVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DSNJSGY4DOOA]. You are receiving this because you are subscribed to this thread. [Image de pistage][https://github.com/notifications/beacon/AAIRFXSTXAGCIF7WG3QZUSLYEEMBLA5CNFSM6AAAAAA7IGI56OWGG33NNVSW45C7OR4XAZNFJFZXG5LFVJRW63LNMVXHIX3JMTHHNFOFLY.gif]
Or you know what? If you can make a PR, then an automated CI workflow will launch a unit test online, so if you make a PR yol don't need to run the unit test yourself. And it will allow to credit you properly for this change :-)
12 nov. 2023 21:42:28 Stephen L. @.***>:
Thank you for your feedback!
I think your change should be fine. There is an exhaustive unit test, you can try to run it with your change, if it works then you are good.
I will try to do that myself but no guarantee, i have a big backlog of projects maintenance...
12 nov. 2023 19:59:34 Bogdan @.***>:
Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).
While doing so, came across this exception:
Traceback (most recent call last): File "/home/user/.local/bin/pff", line 8, in
sys.exit(main()) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/pff.py", line 108, in main return saecc_main(argv=subargs, command=fullcommand) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 574, in main relfilepath_ecc = compute_ecc_hash_from_string(relfilepath, ecc_manager_intra, hasher_intra, max_block_size, resilience_rate_intra) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 203, in compute_ecc_hash_from_string fpfile = BytesIO(b(string)) File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/lib/_compat.py", line 36, in b return codecs.latin_1_encode(x)[0] UnicodeEncodeError: 'latin-1' codec can't encode characters in position 16-26: ordinal not in range(256) Looking at the code, it seems that latin-1 is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:
if sys.version_info < (3,): def b(x): return x else: import codecs def b(x): if isinstance(x, _str): return codecs.latin_1_encode(x)[0] # <-- here else: return x
Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of latin-1 encoding. Example string: зображення.
pyFileFixity version 3.1.4 installed with pip. I'm on Python 3.10.12.
— Reply to this email directly, view it on GitHub[https://github.com/lrq3000/pyFileFixity/issues/13], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIRFXVYFC3FXXKM3VJLUTDYEEMBLAVCNFSM6AAAAAA7IGI56OVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE4DSNJSGY4DOOA]. You are receiving this because you are subscribed to this thread. [Image de pistage][https://github.com/notifications/beacon/AAIRFXSTXAGCIF7WG3QZUSLYEEMBLA5CNFSM6AAAAAA7IGI56OWGG33NNVSW45C7OR4XAZNFJFZXG5LFVJRW63LNMVXHIX3JMTHHNFOFLY.gif]
Ok so I remember why it is in latin-1, because the software encodes byte by byte, and a byte is 255 characters maximum, so the idea was to use latin-1 as a codec if necessary but normally these should be treated as bytes.
This is an old code that remains from the Python 2/3 compatibility era, now since Py2 support is dropped everywhere, I should rewrite this code to be more Py3 idiomatic.
Can you please maybe share a minimum example file that produces this issue? Just a simple text file with some random non latin-1 characters should be enough (I'll try to make some myself but just in case it's good if you can provide an example file too).
Ok I can reproduce the issue using the example filename you provided, thank you very much. I can't believe I never tested a non-latin-1 filename. I will work on it, hopefully it's not too complicated.
Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).
While doing so, came across this exception:
Looking at the code, it seems that
latin-1
is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of
latin-1
encoding.Example string:
зображення
.pyFileFixity version 3.1.4 installed with
pip
. I'm on Python 3.10.12.