Open mgorny opened 2 years ago
I ran into the same problem. Here's a snippet that can be used to show the differences between chardet and cchardet.
import cchardet
import chardet
import glob
for path in glob.glob('tests/illformed/chardet/*'):
data = open(path, 'rb').read()
enc1 = chardet.detect(data)['encoding']
enc2 = cchardet.detect(data)['encoding']
print('%-40s %-20s %-20s %s' % (path, enc1, enc2, 'same' if enc1 == enc2 else 'different'))
tests/illformed/chardet/koi8r.xml KOI8-R KOI8-R same
tests/illformed/chardet/windows1255.xml windows-1255 WINDOWS-1255 different
tests/illformed/chardet/gb2312.xml GB2312 GB18030 different
tests/illformed/chardet/big5.xml Big5 BIG5 different
tests/illformed/chardet/shiftjis.xml SHIFT_JIS SHIFT_JIS same
tests/illformed/chardet/eucjp.xml EUC-JP EUC-JP same
tests/illformed/chardet/euckr.xml EUC-KR UHC different
tests/illformed/chardet/tis620.xml TIS-620 TIS-620 same
When cchardet-2.1.7 and chardet-5.0.0 are both installed, the following tests fail.
FWICS two of them fail because of encoding name mismatches (expected is mixed-case, the value is uppercase), and two of them are recognized as a superset-encoding of the specified encoding (i.e. EUC-KR as UHC, and GB2312 as GB18030).