Parsing subject MIME Header (RFC2047)

jasonm23 commented 1 year ago

Given a subject line from a Hacker News Digest email:

"=?UTF-8?Q?How_to_be_a_-10x_Engineer_=E2=80=94_Creator_of_Catan,?=\r\n =?UTF-8?Q?_Klaus_Teuber,_has_died_=E2=80=94_and_Finland_becomes_the_31st?=\r\n =?UTF-8?Q?_member_of_NATO?="

I noticed that your RFC2047 word decoder will throw a .notEncoded error.

The sub-clause input.countInstances(of: "?") != 4 appears to be the culprit.

Maybe detecting/splitting on newlines and then parsing them individually is the solution.

Would like to get your thoughts on this before submitting a PR.

igorrendulic commented 1 year ago

According to RFC newlines should be encoded as "=0D=0A" within the =? and ?=.

But it does seems like other decoders know how to do this by ignoring the new lines (or probably all the invalid characters) increasing the robustness of the algorithm.

For example in GO:

test := "=?UTF-8?Q?How_to_be_a_-10x_Engineer_=E2=80=94_Creator_of_Catan,?=\r\n =?UTF-8?Q?_Klaus_Teuber,_has_died_=E2=80=94_and_Finland_becomes_the_31st?=\r\n =?UTF-8?Q?_member_of_NATO?="
dec := new(mime.WordDecoder)

sub, err := dec.DecodeHeader(test)
if err != nil {
  t.Fatal(err)
}
fmt.Printf("%s\n", sub)

outputs: How to be a -10x Engineer — Creator of Catan, Klaus Teuber, has died — and Finland becomes the 31st member of NATO

I'd say maybe the parser should be changed to ignore invalid characters.

jasonm23 commented 1 year ago

For a quick workaround, I just collected the individual lines, and decoded them.

But you're right, the \r \n chars would be =0D =0A if the subject line was properly encoded.

Which characters would we ignore?

Edit: Start with newlines and then iterate if new issues raise?

igorrendulic commented 1 year ago

Hm. Looking at the google example I've might of misinterpreted how the library works. Looking into their code it seems like they're not ignoring the charters but keeping them as they are.

Sorry about that.

Based on that I'd just copy what google has done in their library here in method func qDecode(s string) ([]byte, error) https://github.com/golang/go/blob/a025277505d49fac9a5100ae9305020b063657c2/src/mime/encodedword.go#L372

So it seems they're allowing \n, \r, \t characters and everything less equal than ~ and larger equal than (empty space) to be simply copied from the encoded word to decoded word.

...
case (c <= '~' && c >= ' ') || c == '\n' || c == '\r' || c == '\t':
  dec[n] = c
...

jasonm23 commented 1 year ago

So something a bit like

if (c <= "~" && c >= " ") || c == "\n" || c == "\r" || c == "\t" {
    dec[n] = c
}

I need to go look at the code in your project

igorrendulic commented 1 year ago

I haven't checked the code in a while... I'd check the go lang code from the above link and try to do the same. I haven't implemented everything at the time, just what I needed. So I wouldn't be surprised if something is missing.

igorrendulic commented 1 year ago

After some review I've come to conclusion that the method is working properly.

Explanation: The above example in GO uses method dec.DecodeHeader(test) which is a different function from decodeRFC2047Word in this library.

this libraries implementation is a translation of GO library located here: https://github.com/golang/go/blob/a025277505d49fac9a5100ae9305020b063657c2/src/mime/encodedword.go#L372

The equivalent class in this library: https://github.com/igorrendulic/MimeEmailParser/blob/master/Sources/MimeEmailParser/WordDecoder.swift

decodeRFC2047Word is equivalent to GO method : decode The method DecodeHeader is missing in this library.

If someone wants to "translate" method DecodeHeader from GO to this llibrary it I'd welcome such as PR.

The method in question: https://github.com/golang/go/blob/a025277505d49fac9a5100ae9305020b063657c2/src/mime/encodedword.go#L239

jasonm23 commented 1 year ago

func decodeHeader(header: String) -> (String, Error?) {
    var i = header.startIndex

    // If there is no encoded-word, returns before creating a buffer.
    if let range = header.range(of: "=?") {
        i = range.lowerBound
    } else {
        return (header, nil)
    }

    var buf = ""

    buf += String(header[..<i])
    var header = String(header[i...])

    var betweenWords = false

    while true {
        if let start = header.range(of: "=?")?.lowerBound {
            let cur = start + "=?".count

            if let i = header[cur...].range(of: "?")?.lowerBound {
                let charset = String(header[cur..<cur+i])
                var cur = cur + i + "?".count

                if header[cur..<header.endIndex].count < "Q??=".count {
                    break
                }

                let encoding = header[cur]
                cur += 1

                if header[cur] != "?" {
                    break
                }
                cur += 1

                if let j = header[cur...].range(of: "?=")?.lowerBound {
                    let text = String(header[cur..<cur+j])
                    let end = cur + j + "?=".count

                    guard let content = decode(encoding: encoding, text: text) else {
                        betweenWords = false
                        buf += String(header[start..<start+"=?".count])
                        header = String(header[start+"=?".count...])
                        continue
                    }

                    // Write characters before the encoded-word. White-space and newline
                    // characters separating two encoded-words must be deleted.
                    if start > header.startIndex && (!betweenWords || hasNonWhitespace(header[header.startIndex..<start])) {
                        buf += String(header[header.startIndex..<start])
                    }

                    if let err = convert(buf: &buf, charset: charset, content: content) {
                        return ("", err)
                    }

                    header = String(header[end...])
                    betweenWords = true
                } else {
                    break
                }
            } else {
                break
            }
        } else {
            break
        }
    }

    if header.count > 0 {
        buf += header
    }

    return (buf, nil)
}

Recording this here as a starting point of the conversion, it'll need testing/fixing probably, this is just a codepilot attempt at conversion.

I hope to have some time to look at it more closely when I work on the \(emailArchiveViewingProduct).

igorrendulic commented 1 year ago

Just for the reference here is the ChatGPT 4 conversion. I'm also tight with time lately.

In your Go code, you are decoding a header string which seems to be in the format of encoded words used in email headers (RFC 2047). It looks like this code uses some kind of buffer pool, a CharsetReader, and a few other features that aren't directly analogous in Swift, such as the defer keyword and the ability to return multiple values (i.e., a string and an error).

To convert this to Swift, you would need to create your own equivalent of these features or find Swift libraries that provide similar functionality. However, it's important to note that Swift and Go are quite different languages and certain concepts may not have a direct one-to-one mapping between them.

Here's a rough attempt at translating the function to Swift. Note that decode and convert functions are not implemented, as their Go implementation is not given. You'll need to replace those with appropriate Swift code.

import Foundation

class WordDecoder {

    var buffer: String = ""

    func decodeHeader(header: String) -> (String, Error?) {
        var header = header

        guard let i = header.range(of: "=?") else {
            return (header, nil)
        }
        buffer += String(header[..<i.lowerBound])
        header = String(header[i.lowerBound...])

        var betweenWords = false
        while true {
            guard let start = header.range(of: "=?") else {
                break
            }
            var cur = header.index(start.lowerBound, offsetBy: 2)

            guard let i = header[cur...].range(of: "?") else {
                break
            }
            let charset = String(header[cur..<i.lowerBound])
            cur = header.index(i.upperBound, offsetBy: 1)

            guard header.count >= cur.utf16Offset(in: header) + 3 else {
                break
            }
            let encoding = header[cur]
            cur = header.index(cur, offsetBy: 1)

            guard header[cur] == "?" else {
                break
            }
            cur = header.index(cur, offsetBy: 1)

            guard let j = header[cur...].range(of: "?=") else {
                break
            }
            let text = String(header[cur..<j.lowerBound])
            let end = header.index(j.upperBound, offsetBy: 2)

            let content: String? // You need to implement decode function
            let err: Error? // You need to handle error from decode function

            if err != nil {
                betweenWords = false
                buffer += String(header[..<header.index(start.lowerBound, offsetBy: 2)])
                header = String(header[header.index(start.lowerBound, offsetBy: 2)...])
                continue
            }

            if start.lowerBound.utf16Offset(in: header) > 0 {
                buffer += String(header[..<start.lowerBound])
            }

            let err: Error? // You need to implement convert function

            if err != nil {
                return ("", err)
            }

            header = String(header[end...])
            betweenWords = true
        }

        if !header.isEmpty {
            buffer += header
        }

        return (buffer, nil)
    }
}

jasonm23 commented 1 year ago

In your Go code, you are decoding a header string which seems to be in the format of encoded words used in email headers (RFC 2047). It looks like this code uses some kind of buffer pool, a CharsetReader, and a few other features that aren't directly analogous in Swift, such as the defer keyword and the ability to return multiple values (i.e., a string and an error).

To convert this to Swift, you would need to create your own equivalent of these features or find Swift libraries that provide similar functionality. However, it's important to note that Swift and Go are quite different languages and certain concepts may not have a direct one-to-one mapping between them.

Here's a rough attempt at translating the function to Swift. Note that decode and convert functions are not implemented, as their Go implementation is not given. You'll need to replace those with appropriate Swift code.

It's getting very predictable reading GPT responses...

However, it's important to note that

Where it'll try and balance.

Both implementations seem to be creating a buf based on the line pattern of =?${encoded}?= but then doesn't implement decoding. However, It's important to note that, at least, your one points this out in a comment, whereas my one just has a decode(..) call.

To be continued.

jasonm23 commented 1 year ago

I did a bit of digging in the RFC2047 / encoded words.

An 'encoded-word' may not be more than 75 characters long, including 'charset', 'encoding', 'encoded-text', and delimiters. If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used.

While there is no limit to the length of a multiple-line header field, each line of a header field that contains one or more 'encoded-word's is limited to 76 characters.

The length restrictions are included both to ease interoperability through internetwork mail gateways, and to impose a limit on the amount of lookahead a header parser must employ (while looking for a final ?= delimiter) before it can decide whether a token is an "encoded-word" or something else.

So the crlf is just a multiple encoded-word delimiter. So we should be splitting them and decoding each word.

The Golang implementation is hiding this detail for convenience and returning the decoded string. (Hopefully without the crlf)

It really depends if you'd like the method to deal with multipart MIME encoded words, or leave that to the lib user to split first.

igorrendulic commented 1 year ago

I'll close the issue and let the user deal with it for now. If anyone else brings it up it might be useful to re-open.

jasonm23 commented 1 year ago

Agreed, I think within the scope of the function, that's the correct choice.

This would to be the job of a function called decodeMIMEHeader.

However, since the project's concern is with Email Address parsing, as opposed to Email Header/Message parsing, it's not really in scope to add that feature.

Unless I misinterpreted the goals/anti-goals.

igorrendulic / MimeEmailParser

Parsing subject MIME Header (RFC2047) #1