Closed nijel closed 3 months ago
I'll see what I can do. Can't promise anything.
Unfortunately, with the given content, I am unable to determine anything that can help weight-in for the right direction. If you happen to know any "language"-theory that would help in that case, I'll look into it.
As you are already aware of, charset-normalizer handle 90+ encodings, and we absolutely want to avoid adding some hardcoded logic like (latin1 > mac_latin).
but maybe, we could set this logic for tiny sequences only (e.g. order of presented results).
By looking at the CI, both decoded string (mac_latin & latin) are perfectly valid, and actually exist if you search for these terms.
regards,
You wouldn't most likely write KŁster
, but Kłster
. I have no clue if you can somehow weight such things as expected upper-casing.
OK, so we're most likely trapped on this one.
The fastest way to convince your upstream project is to use https://charset-normalizer.readthedocs.io/en/latest/user/advanced_search.html the main API and restrict the supported encoding to those of chardet. cp_isolation=None, # Finite list of encoding to use when searching for a match
This should fix the issue, if so, feel free to close it.
regards,
Thanks for suggestion, I've given it a try at https://salsa.debian.org/python-debian-team/python-debian/-/merge_requests/135
Great.
cp_isolation=[f"iso-8859-{n}" for n in (1, 2, 7, 8, 9)]
Don't forget, to add "ascii", "utf_8", "utf_16", ... the basic ones.
The encoding is first tried as utf-8, so that covers both ascii and utf-8 before passing to charset_normalizer. Mixing utf-16 into ascii like file seems unlikely to me.
Notice I hereby announce that my raw input is not :
Provide the file test.zip
Verbose output Using the CLI, run
normalizer -v ./my-file.txt
and past the result in here.Expected encoding
latin1
would be probably the best fit, chardet reports this asISO-8859-9
which works as well.Desktop (please complete the following information):
Additional context
Discovered when trying to port python-debian to charset_normalizer instead of chardet, this fails the testsuite: https://salsa.debian.org/nijel/python-debian/-/jobs/5758249
I know the text is short, but it's challenging to sell migration to a different library when it breaks existing tests.