Open donlencho opened 6 years ago
Same problem here.
Unfortunately I don't have plans to improve the detection quality. Could you share the data you get poor results with? I can take a look. Hopefully there will be some things to do to get around the issue.
Thanks.
See the attached file, which is encoded in Windows-1252 but detected as GB18030. Thanks for helping!
I'm joining test files and the results I get, as you can see I'm satisfied with the unicode detections (and mostly for asian encodings) but really disappointed by ISO, particularly for western European languages (ISO-1 and 15 for instance, which are really widespread, see: https://www.terena.org/activities/multiling/ml-docs/iso-8859.html ). Thank you in any case!
big5-hkscs.txt → BIG5_HKSCS, reliable: false → OK
big5.txt → BIG5, reliable: false → OK
BIG5.txt → BIG5, reliable: true → OK
euc-jp.txt → GB (=GBA8030?), reliable: false → ~OK
euc-kr.txt → KSC (=?), reliable: false → OK
gbk.txt → GB (=GBA8030?), reliable: false → OK
IBM855.txt → CP-1256, reliable: false → Not OK
ISO-8859-15-CRLF.srt → ASCII, reliable: true → Not OK
ISO-8859-15 euro.txt → ASCII, reliable: true → Not OK
ISO-8859-15 petit test.txt → CP1250, reliable: true → Not OK
ISO-8859-15.srt → ASCII, reliable: true → Not OK
ISO-8859-1.srt → ASCII, reliable: true → Not OK
ISO-8859-6.srt → Arabic, reliable: true → OK
shift_jis.txt → SJC, reliable: true → OK
UTF16BE.srt → UTF16BE, reliable: false → OK
UTF16LE.srt → UTF16LE, reliable: false → OK
UTF-7.txt → ASCII-7 bits, reliable: true → OK
UTF8BOM.srt → UTF8, reliable: true → OK
utf-8 CN.txt → UTF8, reliable: true → OK
UTF8CRLF.srt → UTF8, reliable: true → OK
UTF8CR.srt → UTF8, reliable: true → OK
UTF8LF.srt → UTF8, reliable: true → OK
Just on the off-chance... do you have any idea on how this might be tackled; in case somebody else wants to take a crack at it?
I have good results detecting Unicode encodings and Asian codepages, but really poor results with common European languages files saved in the ISO-8859 family, which are really common and this problem makes compact_enc_det unusable for me. Encoding is always detected as ASCII (and reliable is set to true) for these encodings. ISO-8859-6 for Arabic is OK. Am I the only one? Thanks for letting me know, so I can check if there is a problem or just look for an alternative.