mnako / letters

Letters, or how to parse emails in Go
MIT License
46 stars 9 forks source link

Export `letters.ParseHeaders()` so that a message can be parsed into `letters.Headers` only #85

Closed mnako closed 3 weeks ago

mnako commented 3 months ago

In https://github.com/mnako/letters/pull/72 @rorycl makes a good point for parsing email headers only: the email bodies might be malformed or the user might be interested in headers only.

I do not think, however, that a string arg letters.ParseEmail(r, "HeadersOnly") is the best way to implement this feature. I believe that this should be at least a config struct for future maintainability, but for the time being there is a simpler solution: export the ParseHeaders() identifier so that users can use it directly.

This PR renames letters.parseHeaders() to letters.ParseHeaders(), adds a test, and updates README.md with an example.

One point that might merit discussion, is whether the now public ParseHeaders() should not accept the same parameters as ParseEmail so that the user is not responsible for reading the message using mail.ReadMessage() and passing msg.Header to ParseHeaders().

mnako commented 3 months ago

Thank you for the review, @ValleyCrisps 🙇

@rorycl raised good points in the discussion here https://github.com/mnako/letters/discussions/64, where they mentioned that

[skipping body parsing] dropped my processing time to 1/10th of the time in comparison to full email parsing for about 1700 messages

and in one more place that I cannot find now (not sure if it was a PR comment that was deleted or if I lost track of one thread), where @rorycl said that

sometimes someone might want to check the header of the email with ParseHeaders and then dive into it with ParseEmail, in which case it might make sense to store the header info in the Email struct and avoid parsing the headers again.

I think it is a very good point.

This might be a minor point, but I still have some objections against a headersOnly option, regardles of whether we implement that as a string arg, a config struct, or—I really like that idea as referenced by @rorycl—a self-referential function options.

Generally, I think that options of the form xOnly, skipY, overrideZ point out design flaws and highlight places that are not flexible enough. I think that we should offer the option to parse headers only, but in a more coherent way, rather than framing it as an exception to the default flow.

My current thinking is that we should refactor ParseHeaders and ParseBody as methods on the Email struct.

The default behaviour stays the same: the simplest use case is to parse an email completely by creating an Email instance from a reader:

email, err := letters.ParseEmail(r)
if err != nil {
    log.Fatal(err)
}

but if somebody has a more advanced use case, we can enable them to create an empty Email and parse headers and bodies separately:

var e Email

err := e.ParseHeaders(r)
if err != nil {
    // handle headers parser error
}

err = e.ParseBody(r)
if err != nil {
    // handle body parsing error
}

That way,

This is a pretty nitty discussion, but we have the luxury of taking the time to build a nice library, so let's try to get this nice and elegant.

Thoughts?

ValleyCrisps commented 2 months ago

Thank you for the discussion, I wouldn't have thought of this approach on my own, and I'm learning a lot just by reviewing and reading your comments.

I like the consistency in the method parameters and the logic encapsulation in the proposed solution.
Since we are exploring alternatives, I'd like to hear your opinion on the following points:

  1. Do we need to change the Email struct in order to store partially parsed emails in it? I'm not familiar with optional types in Go, but it feels like we may be weakening our data structure.
  2. Without looking at implementation, it isn't intuitive which fields are being parsed by ParseBody. For example, are attachments included? ContentTypeHeader is another grey area as it must be parsed even for email bodies. To be clear, my point here is about usability more than duplication. Good documentation helps a lot, but self-explanatory function names are even better.
  3. (related to 2.) Have you considered further splitting ParseBody into text (Text, EnrichedText, HTML) and attachments (InlineFiles, AttachedFiles)? In a use-case where performance is a concern, I think skipping attachments would lead to the biggest improvement.
mnako commented 3 weeks ago

Thank you for the review, @ValleyCrisps . Let me merge this first and make a release and we can explore parsing different types of bodies and parsing attachments in a separate PR.