stalwartlabs / mail-parser

Fast and robust e-mail parsing library for Rust
https://docs.rs/mail-parser/
Apache License 2.0
295 stars 40 forks source link

Retrieving headers in message order #19

Closed digitalresistor closed 2 years ago

digitalresistor commented 2 years ago

I am attempting to use this library to create a new message, but I want to keep all the existing headers in the same order they were present in the original message so that things like Received headers don't move around compared to the rest of the headers.

Is this a use case that was considered for this library?

I am not sure the best way to build it since right now it means iterating over all of the RawHeaders and sorting them by their offsets.

mdecimus commented 2 years ago

Do you need to have ordered access to all headers relative to each other or just one header type? Because if you are interested in the order of just one header type such as Received, you can call Message::get_rfc_header(RfcHeader::Received)and that will return the headers in the same order as in the original message.

digitalresistor commented 2 years ago

I would like them to be relative to each other for all headers, not just some headers or a particular one. This is to allow me to use this re-mail email after body modification (encryption/filtering/URL mangling) in a MTA filter. I am hoping to use this library along with mail-builder.

For example:

Return-Path: <noreply@github.com>
Delivered-To: xistence@0x58.com
Received: from butler.0x58.com
    by butler.0x58.com with LMTP
    id JQkvKynllWKn7AAABsIH3A
    (envelope-from <noreply@github.com>)
    for <xistence@0x58.com>; Tue, 31 May 2022 09:51:37 +0000
X-Butler-Approved: Yes
X-Spam-Flag: NO
X-Spam-Score: -7.201
X-Spam-Level: 
X-Spam-Status: No, score=-7.201 tagged_above=-99 required=5
    tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.647, DKIM_SIGNED=0.1,
    DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
    HTML_IMAGE_ONLY_20=1.546, HTML_MESSAGE=0.001, MAILING_LIST_MULTI=-1,
    RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001,
    SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01,
    T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: butler.0x58.com (amavisd-new);
    dkim=pass (1024-bit key) header.d=github.com
Received: from smtp.github.com (out-18.smtp.github.com [192.30.252.201])
    (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits))
    (No client certificate requested)
    by butler.0x58.com (Postfix) with ESMTPS id 894C02D34A
    for <xistence@0x58.com>; Tue, 31 May 2022 09:51:34 +0000 (UTC)
Received: from github-lowworker-b089360.ash1-iad.github.net (github-lowworker-b089360.ash1-iad.github.net [10.56.122.71])
    by smtp.github.com (Postfix) with ESMTP id E134B34032C
    for <xistence@0x58.com>; Tue, 31 May 2022 02:51:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com;
    s=pf2014; t=1653990693;
    bh=FHOLSQ48bVFmRcMJtzB2eCto6t0xhBla7l2C7HruWH4=;
    h=Date:From:Reply-To:To:Cc:In-Reply-To:References:Subject:List-ID:
     List-Archive:List-Post:List-Unsubscribe:From;
    b=eKF5Jz6OUj8QOGc36cO3QT9l5PV2UT+EtzYIosiOF3liJEC4+lQ0iENoYbG37ZweY
     OYmC6CFPgR2pUqh1x9dNg8zf1Bi4mJ7CB+HHiM9QHfJ+8LfeOqK5y+va9ipLwHLpV4
     eRWByBjmWs24xUnzavqjGhL82YTwhbgO5uP8qTxg=
Date: Tue, 31 May 2022 02:51:33 -0700
From: "Mauro D." <notifications@github.com>

The ordering matters here because it allows me to figure out approximately where a particular header was added. For example the DKIM-Signature was added before it was received by github-lowworker-b089360.ash1-iad.github.net. I can see that my server (butler.0x58.com) added Authentication-Results, and that the Delivered-To/Return-Path were added by the LTMP agent. Note that this is an example of a message I could easily pull headers from.

I'm working on a project where I am re-mailing the original message after verification/validation (and additional headers), and I want the recipient to have full access to all headers in the original order so they can trace where particular headers were added.

mdecimus commented 2 years ago

I'll see how I can add that. Would an iterator returning the ordered headers work for your use case? Each item would be a tuple of (HeaderName, HeaderValue).

digitalresistor commented 2 years ago

That would be fantastic!

mdecimus commented 2 years ago

@bertjwregeer I just added a get_raw_headers function that returns the headers in the same order as they appear in the message. It returns the offsets but if you need to access the parsed value of any RFC header you could call the get_header function.

Could you give it a quick test? If everything works well I'll publish a new version to crates.io.

Edit: I'll also add a function to get a string from a HeaderOffset.

digitalresistor commented 2 years ago

Reading the code, it looks to be what I want, I don't mind getting the offsets, I can just turn that into a string myself, but a utility function for it would be great! Are there any tests to make sure this doesn't regress in the future? I didn't see those in the commits that were added.

I haven't had time to pull in this repo and test locally, hoping to have time later tonight for that.

Now I need to figure out how to get them into mail-builder in the same order. I'll open an issue there once I get to that point. Just being able to parse the email is already a great help and will get me a step closer to where I want to be.

mdecimus commented 2 years ago

@bertjwregeer Version 0.4.7 has been published to crates.io with support for accessing raw headers in order. Instead of returning the offsets, the get_raw_headers function now returns strings. Also, there was no test case for raw headers but I've just added one.