tanx / mailreader

RFC parser as an AMD module written with node API for the browser.
MIT License
9 stars 5 forks source link

Content-Transfer-Encoding:8bit #12

Open toberndo opened 9 years ago

toberndo commented 9 years ago

Does mailreader support Content-Transfer-Encoding:8bit?

I'm getting reports (https://github.com/mailvelope/mailvelope/issues/6#issuecomment-74582621) about wrong decoding of umlauts if transfer encoding 8bit was used.

Not sure if this is the right test setup, but I created a rawText:

Content-Type:text/plain; charset="UTF-8"
Content-Transfer-Encoding:8bit

äöü

and after mailreader.parse([{raw: rawText}], function(parsed) {

the result of textParts[0].content with mailreader v0.4.2 is:

���
andris9 commented 9 years ago

I think there's a gap in the docs about it – you should either use Uint8Array|ArrayBuffer or pseudo-binary string values for 8-bit values, not unicode strings. For example, you can use TextEncoder API to convert an unicode string to an Uint8Array:

var buf = new TextEncoder('utf-8').encode('õäöü');
// [0xC3, 0xB5, 0xC3, 0xA4, 0xC3, 0xB6, 0xC3, 0xBC]
felixhammerl commented 9 years ago

which is something we might as well fix in here? just put everything into a uint8array internally? wouldnt make a difference then for thomas...

andris9 commented 9 years ago

The pseudobinary input comes from browserbox, probably doesn't make sense to convert all output from strings to typed arrays and then back again when parsing.

felixhammerl commented 9 years ago

fair enough On Mar 5, 2015 9:13 AM, "Andris Reinman" notifications@github.com wrote:

The pseudobinary input comes from browserbox, probably doesn't make sense to convert all output from strings to typed arrays and then back again when parsing.

— Reply to this email directly or view it on GitHub https://github.com/whiteout-io/mailreader/issues/12#issuecomment-77323123 .

andris9 commented 9 years ago

Incoming data from TCPSocket to BrowserBox is an ArrayBuffer. BrowserBox converts this to pseudo-binary (can't use ascii (might include 8-bit data) or utf-8 (might be something else than utf-8)), does its stuff and passes it on as is. MimeParser on the other end receives the pseudo-binary stuff, detects the correct charset and outputs valid unicode strings. So all string data between TCPSocket input and MimeParser output (which also includes mailreader objects) is in pseudo-binary format by default to minimize conversions from one type to another.

andris9 commented 9 years ago

Just to be clear, pseudo-binary is what you get with this:

var str = unescape(encodeURIComponent('õäöü'));
// "õäöü"

it looks like a 8-bit string while actually it is an unicode string, that only uses the first 256 code points.

toberndo commented 9 years ago

Thanks for the quick response. I tried unescape(encodeURIComponent('õäöü')) with my test content, and yes that would lead to the correct result.

A little background on how we are currently using mailreader: the idea is to simply throw the output of https://github.com/openpgpjs/openpgpjs/blob/master/src/openpgp.js#L139 at mailreader.parse and get the MIME nodes as a result.

I did some more testing and the following works fine:

rawText = unescape(encodeURIComponent(rawText));
that.mailreader.parse([{raw: rawText}], function(parsed) {