GB18030 encoded text is being detected as utf_16, big5 and cp037, and only big5 can decode it.
Detection as utf_16 is very wrong as that codec must contain the utf16 BOM, so the library should be very cautious about that result, but chardet has a patch to do exactly that https://github.com/chardet/chardet/pull/109
GB18030 encoded text is being detected as utf_16, big5 and cp037, and only big5 can decode it.
Detection as utf_16 is very wrong as that codec must contain the utf16 BOM, so the library should be very cautious about that result, but
chardet
has a patch to do exactly that https://github.com/chardet/chardet/pull/109The GB18030 BOM tends to result in detection as cp037 The BOM is regularly causing problems in chardet-like-libraries. c.f. https://github.com/chardet/chardet/issues/178