quentez / talonjs

JavaScript port of the Talon email quote parsing library.
MIT License
15 stars 9 forks source link

Use content type "text/html" when preprocessing an HTML message #34

Closed khennes closed 4 years ago

khennes commented 4 years ago

This change ensures that we pass the correct content-type to preprocess when extracting a reply from HTML.

When preprocessing a plain text document, we search for the OnDateSomebodyWroteRegexp anywhere in the message body instead of matching it only on the beginning of a line. This means that it's easier to find false positives in the reply content - any sentence that matches the pattern "On ..., ... wrote/sent ..." If processing an HTML doc, we can afford to be a bit stricter, and only match that regexp on the beginning of a line. (Incidentally, this is equivalent to what mailgun/talon does.)

As a consequence of this change, however, a Nylas test started failing on the fixture email_15.html. This is because we previously expected to find a splitter in the middle of a line comprised of two blockquote tags:

#!%!12!%!# #!%!13!%!##!%!16!%!# Some text in an inline quote#!%!14!%!##!%!15!%!# On Jan 1 2020, at 12:34 pm, user@example.com <user@example.com> wrote: #!%!17!%!##!%!222!%!# #!%!18!%!#.

Now that we only match the OnDateSomebodyWroteRegexp on the start of a line, that's no longer the case.

Instead, this PR adds <blockquote> to the list of tags that we automatically append a newline char to when converting an XML document to text. The same line is then split into two:

#!%!13!%!##!%!16!%!#
Some text in an inline quote#!%!14!%!##!%!15!%!#
On Jan 1 2020, at 12:34 pm, user@example.com <user@example.com> wrote: #!%!17!%!##!%!222!%!# #!%!18!%!#

This fixes the failing test.

Alternatives considered

quentez commented 4 years ago

@khennes Does that mean we can remove the 2nd part of the preprocess function altogether? And to confirm, OnDateSomebodyWroteRegexp is still being applied as part of the splitter regexes?

khennes commented 4 years ago

@khennes Does that mean we can remove the 2nd part of the preprocess function altogether? And to confirm, OnDateSomebodyWroteRegexp is still being applied as part of the splitter regexes?

Not quite, because preprocess is still called as part of extractFromPlain and we pass text/plain as the content type in that case. And yep, the regex is still applied as part of the splitter regexes.

quentez commented 4 years ago

Ah I see. The Github search was failing me... Yeah that sounds good.