meh / rust-mailbox

MBOX reader.
18 stars 5 forks source link

Fails to parse some Subject headers in Google mbox #3

Open djc opened 7 years ago

djc commented 7 years ago

I get this error a number of times:

error: Err(Error { repr: Custom(Custom { kind: InvalidInput, error: StringError("invalid header") }) })

Here are some example headers from spam in the archive:

b'X-GOOMOJI-Subject: Djc Ochtman: Cash_Loan_Up_To_1000\xc2\xa3, No_Bank_Account'
b'Subject: Djc Ochtman: Cash_Loan_Up_To_1000\xc2\xa3, No_Bank_Account'
b'Subject: How to manipulate her \xe2\x80\x9chorniness\xe2\x80\x9d neuron. (The 5 habits of highly horny women).'

(This led me to wonder if you'd checked out the https://github.com/niax/rust-email crate for the parsing of messages? It would be nice if all the effort on getting these ugly details was shared -- but maybe that crate doesn't fit your needs in some way.)

meh commented 7 years ago

The problem there is that by spec mail headers can only contain ASCII, and that's not happening, I could probably add some way to ignore malformed headers or somehow try to sanitize them, but we'll see.

About the not using the rust-email crate, the issue is that it's using owned strings everywhere, and I need really really fast parsing, in this crate the same String that's read from a line in the mbox file is shared upwards up to the Header struct, so there are no useless copies/allocations.