meh / rust-mailbox

MBOX reader.
18 stars 5 forks source link

Does not parse Google Takeout email archives #2

Open djc opened 6 years ago

djc commented 6 years ago

Google Takeout email archives take the mbox format, but their format doesn't comply with the preferred scheme as laid out in RFC 4155, appendix A. I'm not sure how interoperable you intend for this crate to be? In any case, here's the first line of the file I got last week:

From 1545668983435175434@xxx Fri Sep 16 22:26:51 +0000 2016
meh commented 6 years ago

Do you think you could provide the mbox that's failing to parse with sensitive information removed? Possibly a reduced one with the stuff that's failing, so I can add it as a test to start from for fixing it.

djc commented 6 years ago

What are you looking for, exactly? I'm pretty sure the starting line I provided in my first message will already induce failure in your parser, I'm sure you could easily attach a message to it.

The mbox file I have is 7.8G so editing out sensitive info is quite a bit of work. I'm happy to run your code on it again to check for other failures if you can fix this one, and provide enough context to construct test cases from.

meh commented 6 years ago

@djc the first 3 emails would be enough, it's not just about the first line because it also has to do with the way the emails are separated and other stuff that could be different. For instance I'm fairly sure the error comes because the From separator is different from the one I'm waiting for, but the problem is that I can't just make it laxer because then it could match From whatever in the email body and wrongly split the emails apart.

What I want to avoid is having to go back and forth fixing one error after the other instead of just fixing all of them at once 🐼

djc commented 6 years ago

Okay, so https://dirkjan.ochtman.nl/files/test.mbox should give you a few messages from the head of my mbox file (that is, head -nx gmail.mbox > test.mbox); I only culled some GMail-specific headers and part of the Received headers. I'm not sure the end of the file is similar as the original end of the file.

djc commented 6 years ago

BTW, not sure how aware of this you are; your choice of 8-space formatting and use of the WTFPL are a bit anomalous for the Rust ecosystem. If you'd like your code to be picked up more broadly/get more contributors, you may want to reconsider these choices.

meh commented 6 years ago

That's perfect, I'll see if I can fix it tonight, otherwise it goes to the weekend.

Also it's tabs and not spaces, and I'm aware about the license too, just choices 🐼

droundy commented 5 years ago

Any chance this will be fixed?

djc commented 5 years ago

@droundy you might want to try the crate I published, https://github.com/djc/mbox-reader.

droundy commented 5 years ago

@djc I looked at mbox_reader, but found the complete lack of documentation unencouraging.

djc commented 5 years ago

@droundy if you open an issue in that project, I'll write you some.