google / compact_enc_det

compact_enc_det - Compact Encoding Detection
Apache License 2.0
212 stars 77 forks source link

Failing to detect ISO-8859 encodings #8

Open donlencho opened 6 years ago

donlencho commented 6 years ago

I have good results detecting Unicode encodings and Asian codepages, but really poor results with common European languages files saved in the ISO-8859 family, which are really common and this problem makes compact_enc_det unusable for me. Encoding is always detected as ASCII (and reliable is set to true) for these encodings. ISO-8859-6 for Arabic is OK. Am I the only one? Thanks for letting me know, so I can check if there is a problem or just look for an alternative.

ghost commented 6 years ago

Same problem here.

JinsukKim commented 6 years ago

Unfortunately I don't have plans to improve the detection quality. Could you share the data you get poor results with? I can take a look. Hopefully there will be some things to do to get around the issue.

Thanks.

ghost commented 6 years ago

See the attached file, which is encoded in Windows-1252 but detected as GB18030. Thanks for helping!

ansi.txt

donlencho commented 6 years ago

I'm joining test files and the results I get, as you can see I'm satisfied with the unicode detections (and mostly for asian encodings) but really disappointed by ISO, particularly for western European languages (ISO-1 and 15 for instance, which are really widespread, see: https://www.terena.org/activities/multiling/ml-docs/iso-8859.html ). Thank you in any case!

big5-hkscs.txt      →   BIG5_HKSCS, reliable: false     → OK
big5.txt        →   BIG5, reliable: false           → OK
BIG5.txt        →   BIG5, reliable: true            → OK
euc-jp.txt      →   GB (=GBA8030?), reliable: false     → ~OK
euc-kr.txt      →   KSC (=?), reliable: false       → OK
gbk.txt         →   GB (=GBA8030?), reliable: false     → OK
IBM855.txt      →   CP-1256, reliable: false        → Not OK
ISO-8859-15-CRLF.srt    →   ASCII, reliable: true           → Not OK
ISO-8859-15 euro.txt    →   ASCII, reliable: true           → Not OK
ISO-8859-15 petit test.txt  →   CP1250, reliable: true      → Not OK
ISO-8859-15.srt     →   ASCII, reliable: true           → Not OK
ISO-8859-1.srt      →   ASCII, reliable: true           → Not OK
ISO-8859-6.srt      →   Arabic, reliable: true          → OK
shift_jis.txt       →   SJC, reliable: true         → OK
UTF16BE.srt     →   UTF16BE, reliable: false        → OK
UTF16LE.srt     →   UTF16LE, reliable: false        → OK
UTF-7.txt       →   ASCII-7 bits, reliable: true        → OK
UTF8BOM.srt     →   UTF8, reliable: true            → OK
utf-8 CN.txt        →   UTF8, reliable: true            → OK
UTF8CRLF.srt        →   UTF8, reliable: true            → OK
UTF8CR.srt      →   UTF8, reliable: true            → OK
UTF8LF.srt      →   UTF8, reliable: true            → OK

encodings.tar.gz

Lord-Kamina commented 5 years ago

Just on the off-chance... do you have any idea on how this might be tackled; in case somebody else wants to take a crack at it?