Currently, Attachment Converter will sometimes mix newline formats, resulting in (for example) a Windows (CRLF) MBOX that temporarily turns Unix (LF) every time one of the emails in it has converted attachments. This is undesirable behavior; any data output by Attachment Converter should have a consistent newline format.
Why we even have LFs in our MBOXes
Officially, the email specification only allows emails with Windows (CRLF) newlines to be sent out. However, there are no official rules about what newline format a mail user agent should store emails it receives and sends in locally. The standard behavior has generally been for every mail user agent to store email locally using the newline format of the operating system it is running on.
That works well enough if you're just using email, but in the archival setting, we want to convert as few of the original data as possible---only the bare minimum necessary for us to look at what we have.
Therefore, it seems that what we need is code to help Attachment Converter:
detect the newline format of the input
have any new emails it adds in adopt the same newline format
Another issue is that Mr. Mime fails with a parse error unless you give it an email that uses Windows newlines. We therefore have independent reason to add another bullet point to the above list, for pre or post processing:
convert Unix line breaks to Windows line breaks
convert Unix line breaks to Windows line breaks
The LF-to-CRLF pre-processing can be used both to deal with Mr. Mime's requirements and to deal with this newline format consistency issue.
The Strategy
Here is our proposed strategy for dealing with the newline problem:
case 1: OCamlnet is the backend
1) detect the newline format of the input
2) serialize the new emails with converted attachments using whatever newline option matches the detected format
case 2: Mr. Mime is the backend
1) detect the newline format of the input
2) if the input is Unix, convert it to Windows
3) produce our output, as normal
4) if the input email was Unix, convert the entire email back to Unix newlines
How to detect the newline format of the original
The logic we discussed for detecting the newline format of the input
is:
do a Unix readline, which will stop reading at the first LF
if it ends in a CR, then the input used Windows newlines
if it hits EOF before finding the first CR, it uses old-fashioned macOS line breaks
The goal
For this issue, please implement the above logic so that whenever Attachment Converter converts an email, the output matches the newline format of the input. This should work both in single-email and in MBOX mode.
Currently, Attachment Converter will sometimes mix newline formats, resulting in (for example) a Windows (CRLF) MBOX that temporarily turns Unix (LF) every time one of the emails in it has converted attachments. This is undesirable behavior; any data output by Attachment Converter should have a consistent newline format.
Why we even have LFs in our MBOXes
Officially, the email specification only allows emails with Windows (CRLF) newlines to be sent out. However, there are no official rules about what newline format a mail user agent should store emails it receives and sends in locally. The standard behavior has generally been for every mail user agent to store email locally using the newline format of the operating system it is running on.
That works well enough if you're just using email, but in the archival setting, we want to convert as few of the original data as possible---only the bare minimum necessary for us to look at what we have.
Therefore, it seems that what we need is code to help Attachment Converter:
Another issue is that Mr. Mime fails with a parse error unless you give it an email that uses Windows newlines. We therefore have independent reason to add another bullet point to the above list, for pre or post processing:
The LF-to-CRLF pre-processing can be used both to deal with Mr. Mime's requirements and to deal with this newline format consistency issue.
The Strategy
Here is our proposed strategy for dealing with the newline problem:
case 1: OCamlnet is the backend
1) detect the newline format of the input
2) serialize the new emails with converted attachments using whatever newline option matches the detected format
case 2: Mr. Mime is the backend
1) detect the newline format of the input
2) if the input is Unix, convert it to Windows
3) produce our output, as normal
4) if the input email was Unix, convert the entire email back to Unix newlines
How to detect the newline format of the original
The logic we discussed for detecting the newline format of the input is:
The goal
For this issue, please implement the above logic so that whenever Attachment Converter converts an email, the output matches the newline format of the input. This should work both in single-email and in MBOX mode.