uchicago-library / attachment-converter

Attachment Converter: tool for batch converting attachments in an email mailbox
GNU General Public License v2.0
8 stars 3 forks source link

Stabilize converted line endings #90

Closed bufordrat closed 1 month ago

bufordrat commented 7 months ago

Currently, Attachment Converter will sometimes mix newline formats, resulting in (for example) a Windows (CRLF) MBOX that temporarily turns Unix (LF) every time one of the emails in it has converted attachments. This is undesirable behavior; any data output by Attachment Converter should have a consistent newline format.

Why we even have LFs in our MBOXes

Officially, the email specification only allows emails with Windows (CRLF) newlines to be sent out. However, there are no official rules about what newline format a mail user agent should store emails it receives and sends in locally. The standard behavior has generally been for every mail user agent to store email locally using the newline format of the operating system it is running on.

That works well enough if you're just using email, but in the archival setting, we want to convert as few of the original data as possible---only the bare minimum necessary for us to look at what we have.

Therefore, it seems that what we need is code to help Attachment Converter:

Another issue is that Mr. Mime fails with a parse error unless you give it an email that uses Windows newlines. We therefore have independent reason to add another bullet point to the above list, for pre or post processing:

The LF-to-CRLF pre-processing can be used both to deal with Mr. Mime's requirements and to deal with this newline format consistency issue.

The Strategy

Here is our proposed strategy for dealing with the newline problem:

How to detect the newline format of the original

The logic we discussed for detecting the newline format of the input is:

The goal

For this issue, please implement the above logic so that whenever Attachment Converter converts an email, the output matches the newline format of the input. This should work both in single-email and in MBOX mode.