groupe-sii / ogham

Sending email, sms or whatever is a piece of cake
https://groupe-sii.github.io/ogham/
Apache License 2.0
21 stars 14 forks source link

Charset detection #81

Open aurelien-baudet opened 5 years ago

aurelien-baudet commented 5 years ago

Add real implementations for charset detection. Use charset detection everytime getBytes is used. Possible libraries or tools:

Also provide a way for users to override automatic guessing for particular file.

Development branch is features/charset/detection

As far as I know, there is no general library in this context to be suitable for all types of problems. So, for each problem you should test the existing libraries and select the best one which satisfies your problem’s constraints, but often none of them is appropriate. In these cases you can write your own Encoding Detector! As I have wrote ...

I’ve wrote a meta java tool for detecting charset encoding of HTML Web pages, using IBM ICU4j and Mozilla JCharDet as the built-in components. Here you can find my tool, please read the README section before anything else. Also, you can find some basic concepts of this problem in my paper and in its references.

Bellow I provided some helpful comments which I’ve experienced in my work:

  • Charset detection is not a foolproof process, because it is essentially based on statistical data and what actually happens is guessing not detecting
  • icu4j is the main tool in this context by IBM, imho
  • Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests (at most %1, as I remember)
  • icu4j is much more general than jchardet, icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8
  • Due to the widespread use of UTF-8 in HTML-world; jchardet is a better choice than icu4j in overall, but is not the best choice!
  • icu4j is great for East Asian specific encodings like EUC-KR, EUC-JP, SHIFT_JIS, BIG5 and the GB family encodings
  • Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings. Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic
  • Almost all encoding detection tools are using statistical methods, so the accuracy of output strongly depends on the size and the contents of the input
  • Some encodings are essentially the same just with a partial differences, so in some cases the guessed or detected encoding may be false but at the same time be true! As about Windows-1252 and ISO-8859-1. (refer to the last paragraph under the 5.2 section of my paper)