thombashi / mbstrdecoder

Python library for multi-byte character string decoder.
MIT License
6 stars 3 forks source link

GB18030 #3

Open jayvdb opened 5 years ago

jayvdb commented 5 years ago

GB18030 encoded text is being detected as utf_16, big5 and cp037, and only big5 can decode it.

Detection as utf_16 is very wrong as that codec must contain the utf16 BOM, so the library should be very cautious about that result, but chardet has a patch to do exactly that https://github.com/chardet/chardet/pull/109

The GB18030 BOM tends to result in detection as cp037 The BOM is regularly causing problems in chardet-like-libraries. c.f. https://github.com/chardet/chardet/issues/178

thombashi commented 4 years ago

Thank you for your report.

For now, I will wait for the PR to be merged.