purebred-mua / purebred-email

A fast email parsing library implemented in Haskell
https://hackage.haskell.org/package/purebred-email
GNU Affero General Public License v3.0
23 stars 4 forks source link

Can't parse multipart/related followed by multipart/alternative #70

Open tomjaguarpaw opened 2 years ago

tomjaguarpaw commented 2 years ago

I am experiencing a problem with version 0.5. This problem was not present in 0.4.3.

I have several emails in my archive that have the following header:

Content-Type: multipart/related;
        boundary="<boundary>";
        type="multipart/alternative"

I the following parse error:

InvalidParameterValue "type" "multipart/alternative"

When parsing multipart/related we getRequiredParam "type" which succeeds and then parsed parseContentType which fails (https://github.com/purebred-mua/purebred-email/blob/v0.5/src/Data/MIME.hs#L928-L934). parseContentType explicitly looks for a boundary(https://github.com/purebred-mua/purebred-email/blob/v0.5/src/Data/MIME.hs#L593-L603) but there is none for multipart/alternative. The boundary here belongs to multipart/related.

I'm a bit confused at this point. I'm not familiar with the spec, but perhaps we should be looking for just something of form type="<...>/<...>" rather than something with another boundary? (If so then I'm surprised that I'm the first person to run into this.)

tomjaguarpaw commented 2 years ago

This seems to be where the issue arose: https://github.com/purebred-mua/purebred-email/commit/47a9e818706dca87358d588d2fde77c7a32aa142#diff-4aeeafbfcb59f18a0a43c805646b96c69c0f72820b15ed2a24508621d2e0152dR684-R704

frasertweedale commented 2 years ago

It is a bug. Incidentally I was going to change this behaviour for the next release, but didn't quite get to it yet. In fact, this is the last thing I wanted to do before the next release (which also fixes the other issues you reported). But I got busy with other tasks.

Although I was aware the current behaviour deviates from the RFCs, I did not notice that it could fail in this way. I only thought it would accept some invalid data - not reject valid data. You have provided a nice test case :)

I shall endeavour to fix this and release v0.next as soon as possible. But I am so busy right now, it might not be until after Christmas. I apologise for the bug!

tomjaguarpaw commented 2 years ago

Thanks for the update and no need to apologise. I am enjoying using this library! If there's anything that I can help with in the meantime then please let me know. If it would be helpful, and you can give me a hint of what to fix, then I can try to fix it myself an submit a PR.

frasertweedale commented 2 years ago

Many thanks for the offer @tomjaguarpaw. And if I wasn't already halfway through the fix, I would take up your generous offer! But if you continue using this library, perhaps there is something you will contribute - besides your bug reports of course, which are themselves valuable contributions.

For the sake of transparency (and reducing bus factor, I suppose), I will sketch the problem and the approach I'm taking to fix it. RIght now we parse the "type" parameter as a full content-type, possibly with parameters, instead of just type and subtype as specified by RFC 2387. The fix is to introduce a new type for "type and subtype" - no parameters - and parse just that. The Related constructor must be updated to use the new type - an API breaking change. Then perhaps also refactor the existing ContentType data type and parser to reuse the new type and parser.

frasertweedale commented 2 years ago

I have several emails in my archive that have the following header:

Content-Type: multipart/related;
        boundary="<boundary>";
        type="multipart/alternative"

Side note: how nice to find an RFC-compliant multipart/related production :D

singpolyma commented 1 year ago

I just had to downgrade due to a related issue choking on this header: Content-Type: multipart/related;boundary=1_62BC61B6_29AF2480;Start="<smil>";Type="application/smil"

frasertweedale commented 1 year ago

@singpolyma thanks for the report. I will investigate in the next couple weeks.

frasertweedale commented 1 year ago

Never mind, I see the problem. <smil> is not a valid value for the start parameter. Per https://www.rfc-editor.org/rfc/rfc2387:

cid             := msg-id     ; c.f. [[822](https://www.rfc-editor.org/rfc/rfc2387#ref-822)]

and https://www.rfc-editor.org/rfc/rfc5322#section-3.6.4:

   msg-id          =   [CFWS] "<" id-left "@" id-right ">" [CFWS]

The value must contain an @, but does not.

@singpolyma Can you please provide more context? What program is producing this message, and how widespread are these non-conformant messages?

singpolyma commented 1 year ago

@singpolyma Can you please provide more context? What program is producing this message, and how widespread are these non-conformant messages?

I would say over half the MMS our system sees have a header with this param included and no @

frasertweedale commented 1 year ago

@singpolyma thanks. I don't know anything about the implementation of MMS. Is there a normative specification somewhere?

Which vendor(s) produce the non-conforming messages? Have you raised ticket(s) with them?

Would it be possible to detect and "massage" the non-conforming messages into conformance before processing them with purebred-email?

singpolyma commented 1 year ago

I could pre-massage but I'd basically need a header parser to do that with I guess.

I think the MMS messages are forwarded mostly-unmodified from various carriers, so convincing Verizon to change how they format their SMTP headers is probably not something I can do.

Using an older version of purebred is working for me for now.

frasertweedale commented 1 year ago

@singpolyma do you think you could send me some (real world, complete) sample messages that exhibit this non-conformance? And I'll have a think about how to approach this problem.

singpolyma commented 1 year ago

mms1.txt This is an example message, with phone numbers scrubbed.

I've found another spec violation that comes from specifically Google Voice they send like this example: mms2.txt

This Content-Type header does not have a "type" parameter at all.

It seems to me that both of these examples are syntactically valid, but violate various semantic requirements (required argument, syntax of particular argument in some contexts) and it would be nice if in these cases the message still parsed at all into parts, etc (which seems like it was more true on older versions of the library).