ram-sharma-6453 / email-mime-parser

A mime4j based simplified email mime parser for java
Apache License 2.0
46 stars 16 forks source link

How to decode text/html body contents with the right charset? #11

Closed josdejong closed 4 years ago

josdejong commented 4 years ago

First, thanks for sharing this library, it works like a charm :+1:

I sometimes have emails containing a charset like iso-8859-1 in an email:

...

--------------000908080602050203020305
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

--------------000908080602050203020305
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
...
</html>
--------------000908080602050203020305--

Using htmlEmailBody I get an Attachment with an input stream. When turning the stream into a string, I have to know which charset is used to encode the email. But how can get this information from the email?

See also the following test of this library, where a hardcoded charset "GB18030" is used to decode. How can I determine that encoding from the email itself? https://github.com/ram-sharma-6453/email-mime-parser/blob/c304baf8948a30a124ba6e2c6ba8dfb27bf41c9f/src/test/java/tech/blueglacier/parser/ParserTest.java#L264-L270

josdejong commented 4 years ago

I figured it out:

Email email = ...
Attachment htmlBody = email.htmlEmailBody;

if (htmlBody != null) {
    String html = IOUtils.toString(htmlBody.getIs(), htmlBody.getBd().getCharset());
}

Here getBd() returns a BodyDescriptor with the parsed content type, charset, etc.

ram-sharma-6453 commented 4 years ago

Hi,

I am happy that the library is proving to be a great help to you.

The solution you found out for your problem is correct.

Meanwhile the test case that you have cited is to handle special case of handling chinese character set only. Sometimes the charset specified in the email is not sufficient enough to decode all the characters of the email text, the problem is probably because some character set implementations are buggy. So for those cases the chinese character set that covers all the possible characters available for decoding is used. 'GB18030' is one such chinese character set.

josdejong commented 4 years ago

Ah, thanks for the clarification. Nice that its possible to select your own charset if needed.