Closed jasonm23 closed 1 year ago
According to RFC newlines should be encoded as "=0D=0A" within the =?
and ?=
.
But it does seems like other decoders know how to do this by ignoring the new lines (or probably all the invalid characters) increasing the robustness of the algorithm.
For example in GO:
test := "=?UTF-8?Q?How_to_be_a_-10x_Engineer_=E2=80=94_Creator_of_Catan,?=\r\n =?UTF-8?Q?_Klaus_Teuber,_has_died_=E2=80=94_and_Finland_becomes_the_31st?=\r\n =?UTF-8?Q?_member_of_NATO?="
dec := new(mime.WordDecoder)
sub, err := dec.DecodeHeader(test)
if err != nil {
t.Fatal(err)
}
fmt.Printf("%s\n", sub)
outputs: How to be a -10x Engineer — Creator of Catan, Klaus Teuber, has died — and Finland becomes the 31st member of NATO
I'd say maybe the parser should be changed to ignore invalid characters.
For a quick workaround, I just collected the individual lines, and decoded them.
But you're right, the \r
\n
chars would be =0D
=0A
if the subject line was properly encoded.
Which characters would we ignore?
Edit: Start with newlines and then iterate if new issues raise?
Hm. Looking at the google example I've might of misinterpreted how the library works. Looking into their code it seems like they're not ignoring the charters but keeping them as they are.
Sorry about that.
Based on that I'd just copy what google has done in their library here in method func qDecode(s string) ([]byte, error)
https://github.com/golang/go/blob/a025277505d49fac9a5100ae9305020b063657c2/src/mime/encodedword.go#L372
So it seems they're allowing \n
, \r
, \t
characters and everything less equal than ~
and larger equal than
(empty space) to be simply copied from the encoded word to decoded word.
...
case (c <= '~' && c >= ' ') || c == '\n' || c == '\r' || c == '\t':
dec[n] = c
...
So something a bit like
if (c <= "~" && c >= " ") || c == "\n" || c == "\r" || c == "\t" {
dec[n] = c
}
I need to go look at the code in your project
I haven't checked the code in a while... I'd check the go lang code from the above link and try to do the same. I haven't implemented everything at the time, just what I needed. So I wouldn't be surprised if something is missing.
After some review I've come to conclusion that the method is working properly.
Explanation:
The above example in GO uses method dec.DecodeHeader(test)
which is a different function from decodeRFC2047Word
in this library.
this libraries implementation is a translation of GO library located here: https://github.com/golang/go/blob/a025277505d49fac9a5100ae9305020b063657c2/src/mime/encodedword.go#L372
The equivalent class in this library: https://github.com/igorrendulic/MimeEmailParser/blob/master/Sources/MimeEmailParser/WordDecoder.swift
decodeRFC2047Word
is equivalent to GO method : decode
The method DecodeHeader
is missing in this library.
If someone wants to "translate" method DecodeHeader
from GO to this llibrary it I'd welcome such as PR.
The method in question: https://github.com/golang/go/blob/a025277505d49fac9a5100ae9305020b063657c2/src/mime/encodedword.go#L239
func decodeHeader(header: String) -> (String, Error?) {
var i = header.startIndex
// If there is no encoded-word, returns before creating a buffer.
if let range = header.range(of: "=?") {
i = range.lowerBound
} else {
return (header, nil)
}
var buf = ""
buf += String(header[..<i])
var header = String(header[i...])
var betweenWords = false
while true {
if let start = header.range(of: "=?")?.lowerBound {
let cur = start + "=?".count
if let i = header[cur...].range(of: "?")?.lowerBound {
let charset = String(header[cur..<cur+i])
var cur = cur + i + "?".count
if header[cur..<header.endIndex].count < "Q??=".count {
break
}
let encoding = header[cur]
cur += 1
if header[cur] != "?" {
break
}
cur += 1
if let j = header[cur...].range(of: "?=")?.lowerBound {
let text = String(header[cur..<cur+j])
let end = cur + j + "?=".count
guard let content = decode(encoding: encoding, text: text) else {
betweenWords = false
buf += String(header[start..<start+"=?".count])
header = String(header[start+"=?".count...])
continue
}
// Write characters before the encoded-word. White-space and newline
// characters separating two encoded-words must be deleted.
if start > header.startIndex && (!betweenWords || hasNonWhitespace(header[header.startIndex..<start])) {
buf += String(header[header.startIndex..<start])
}
if let err = convert(buf: &buf, charset: charset, content: content) {
return ("", err)
}
header = String(header[end...])
betweenWords = true
} else {
break
}
} else {
break
}
} else {
break
}
}
if header.count > 0 {
buf += header
}
return (buf, nil)
}
Recording this here as a starting point of the conversion, it'll need testing/fixing probably, this is just a codepilot attempt at conversion.
I hope to have some time to look at it more closely when I work on the \(emailArchiveViewingProduct)
.
Just for the reference here is the ChatGPT 4 conversion. I'm also tight with time lately.
In your Go code, you are decoding a header string which seems to be in the format of encoded words used in email headers (RFC 2047). It looks like this code uses some kind of buffer pool, a CharsetReader, and a few other features that aren't directly analogous in Swift, such as the defer keyword and the ability to return multiple values (i.e., a string and an error).
To convert this to Swift, you would need to create your own equivalent of these features or find Swift libraries that provide similar functionality. However, it's important to note that Swift and Go are quite different languages and certain concepts may not have a direct one-to-one mapping between them.
Here's a rough attempt at translating the function to Swift. Note that decode and convert functions are not implemented, as their Go implementation is not given. You'll need to replace those with appropriate Swift code.
import Foundation
class WordDecoder {
var buffer: String = ""
func decodeHeader(header: String) -> (String, Error?) {
var header = header
guard let i = header.range(of: "=?") else {
return (header, nil)
}
buffer += String(header[..<i.lowerBound])
header = String(header[i.lowerBound...])
var betweenWords = false
while true {
guard let start = header.range(of: "=?") else {
break
}
var cur = header.index(start.lowerBound, offsetBy: 2)
guard let i = header[cur...].range(of: "?") else {
break
}
let charset = String(header[cur..<i.lowerBound])
cur = header.index(i.upperBound, offsetBy: 1)
guard header.count >= cur.utf16Offset(in: header) + 3 else {
break
}
let encoding = header[cur]
cur = header.index(cur, offsetBy: 1)
guard header[cur] == "?" else {
break
}
cur = header.index(cur, offsetBy: 1)
guard let j = header[cur...].range(of: "?=") else {
break
}
let text = String(header[cur..<j.lowerBound])
let end = header.index(j.upperBound, offsetBy: 2)
let content: String? // You need to implement decode function
let err: Error? // You need to handle error from decode function
if err != nil {
betweenWords = false
buffer += String(header[..<header.index(start.lowerBound, offsetBy: 2)])
header = String(header[header.index(start.lowerBound, offsetBy: 2)...])
continue
}
if start.lowerBound.utf16Offset(in: header) > 0 {
buffer += String(header[..<start.lowerBound])
}
let err: Error? // You need to implement convert function
if err != nil {
return ("", err)
}
header = String(header[end...])
betweenWords = true
}
if !header.isEmpty {
buffer += header
}
return (buffer, nil)
}
}
In your Go code, you are decoding a header string which seems to be in the format of encoded words used in email headers (RFC 2047). It looks like this code uses some kind of buffer pool, a CharsetReader, and a few other features that aren't directly analogous in Swift, such as the defer keyword and the ability to return multiple values (i.e., a string and an error).
To convert this to Swift, you would need to create your own equivalent of these features or find Swift libraries that provide similar functionality. However, it's important to note that Swift and Go are quite different languages and certain concepts may not have a direct one-to-one mapping between them.
Here's a rough attempt at translating the function to Swift. Note that decode and convert functions are not implemented, as their Go implementation is not given. You'll need to replace those with appropriate Swift code.
It's getting very predictable reading GPT responses...
However, it's important to note that
Where it'll try and balance.
Both implementations seem to be creating a buf based on the line pattern of =?${encoded}?=
but then doesn't implement decoding. However, It's important to note that, at least, your one points this out in a comment, whereas my one just has a decode(..)
call.
To be continued.
I did a bit of digging in the RFC2047 / encoded words.
An 'encoded-word' may not be more than 75 characters long, including 'charset', 'encoding', 'encoded-text', and delimiters. If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used.
While there is no limit to the length of a multiple-line header field, each line of a header field that contains one or more 'encoded-word's is limited to 76 characters.
The length restrictions are included both to ease interoperability through internetwork mail gateways, and to impose a limit on the amount of lookahead a header parser must employ (while looking for a final ?= delimiter) before it can decide whether a token is an "encoded-word" or something else.
So the crlf
is just a multiple encoded-word delimiter. So we should be splitting them and decoding each word.
The Golang implementation is hiding this detail for convenience and returning the decoded string. (Hopefully without the crlf)
It really depends if you'd like the method to deal with multipart MIME encoded words, or leave that to the lib user to split first.
I'll close the issue and let the user deal with it for now. If anyone else brings it up it might be useful to re-open.
Agreed, I think within the scope of the function, that's the correct choice.
This would to be the job of a function called decodeMIMEHeader
.
However, since the project's concern is with Email Address parsing, as opposed to Email Header/Message parsing, it's not really in scope to add that feature.
Unless I misinterpreted the goals/anti-goals.
Given a subject line from a Hacker News Digest email:
I noticed that your RFC2047 word decoder will throw a
.notEncoded
error.The sub-clause
input.countInstances(of: "?") != 4
appears to be the culprit.Maybe detecting/splitting on newlines and then parsing them individually is the solution.
Would like to get your thoughts on this before submitting a PR.