python / cpython

The Python programming language
https://www.python.org
Other
62.17k stars 29.88k forks source link

mbox parser incorrect behaviour #55937

Open 3fa1d2a2-ce47-4160-b467-43028631d81b opened 13 years ago

3fa1d2a2-ce47-4160-b467-43028631d81b commented 13 years ago
BPO 11728
Nosy @warsaw, @bitdancer, @akheron

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', '3.8', 'expert-email', '3.10', 'library', '3.9'] title = 'mbox parser incorrect behaviour' updated_at = user = 'https://bugs.python.org/wally1980' ``` bugs.python.org fields: ```python activity = actor = 'iritkatriel' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)', 'email'] creation = creator = 'wally1980' dependencies = [] files = [] hgrepos = [] issue_num = 11728 keywords = [] message_count = 8.0 messages = ['132657', '132671', '132687', '138245', '163812', '163872', '163902', '164636'] nosy_count = 5.0 nosy_names = ['barry', 'r.david.murray', 'sdaoden', 'wally1980', 'petri.lehtinen'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue11728' versions = ['Python 3.8', 'Python 3.9', 'Python 3.10'] ```

3fa1d2a2-ce47-4160-b467-43028631d81b commented 13 years ago

mailbox.mbox parser is splitting mbox files by "^From " pattern, which is wrong , in fairy it should split mbox by "\nFrom ". Illustration: ------ From bla-blah@localhost Header1 Header2 body1 body2

From blah-blah2@localhost Header1 body1 From your dear friend body3

------ This mbox would be splitted in 3 messages instead of 2

bitdancer commented 13 years ago

All the references I could find talk about triggering the match without the proceeding newline. That is, it is not certain that a blank line will precede the 'From ' header, and the typical quoting rules for mbox format call for any 'From ' at the start of a line (whether preceded by a blank line or not) to be quoted. This might have something to do with the fact that otherwise you have to special case the first line of the mbox, but I don't really know.

What tool are you using that is producing the unquoted 'From ' lines in your mbox? I know there are variants on the mbox format, so if one of them has the format you propose, this would become a feature request to support that variant mbox format.

3fa1d2a2-ce47-4160-b467-43028631d81b commented 13 years ago

On Thu, 31 Mar 2011 14:13:50 +0000 "R. David Murray" \report@bugs.python.org\ wrote:

R. David Murray \rdmurray@bitdance.com\ added the comment:

All the references I could find talk about triggering the match without the proceeding newline. That is, it is not certain that a blank line will precede the 'From ' header, and the typical quoting rules for mbox format call for any 'From ' at the start of a line (whether preceded by a blank line or not) to be quoted. This might have something to do with the fact that otherwise you have to special case the first line of the mbox, but I don't really know.

What tool are you using that is producing the unquoted 'From ' lines in your mbox? I know there are variants on the mbox format, so if one of them has the format you propose, this would become a feature request to support that variant mbox format.

---------- nosy: +r.david.murray

Hello, David !

This is an email from netcraft mailing list - the host which accepted it is running sendmail with some antivirus software on top - mimedefang + spamassassin from what I know. Could be tat something is broken in that chain, I've spotted the error when I was writing the script for mailbox --> maildir conversion, while migrating this server. So I had to inherit mailbox.mbox and fix as I need, I'll investigate further what lead to such behaviour. Nevertheless, here is snippet from rfc4155 -
In order to improve interoperability among messaging systems, this memo defines a "default" mbox database format, which MUST be supported by all implementations that claim to be compliant with this specification.

The "default" mbox database format uses a linear sequence of Internet messages, with each message being immediately prefaced by a separator line, and being terminated by an empty line.

--- So I think assuming that there should be an empty line before "From " separator line is fine (for the second email and further) and would help to deal with all kinds of mbox mailboxes, fix is rather trivial.

Best regards, Valery Masiutsin

5792609d-7136-4bf5-a72c-931da2480f6a commented 13 years ago

Hello Valery Masiutsin, i recently stumbled over this while searching for the link to the standart i've stored in another issue. (Without being logged in, say.) The de-facto standart (http://qmail.org/man/man5/mbox.html) says:

HOW A MESSAGE IS READ A reader scans through an mbox file looking for From lines. Any From line marks the beginning of a message. The reader should not attempt to take advantage of the fact that every From_ line (past the beginning of the file) is preceded by a blank line.

This is however the recent version. The "mbox" manpage of my up-to-date Mac OS X 10.6.7 does not state this, for example. It's from 2002. However, all known MBOX standarts, i.e. MBOXO, MBOXRD, MBOXCL, require proper quoting of non-From_ "From " lines (by preceeding with '>'). So your example should not fail in Python. (But hey - are you sure *that* has been produced by Perl?)

You're right however that Python seems to only support the old MBOXO way of un-escaping only plain "From " to/from ">From ", which is not even mentioned anymore in the current standart - that only describes MBOXRD ("(>*From )" -> ">"+match.group(1)). (Lucky me: i own Mac OS X, otherwise i wouldn't even know.) Thus you're in trouble if the unescaping is performed before the split.. This is another issue, though: "MBOX parser uses MBOXO algorithm".

;> - Ciao, Steffen

akheron commented 12 years ago

It seems to me that "^From " is the correct way to match the start of each message. This is also what the qmail manual page (linked in the previous message) says. So closing as invalid.

3fa1d2a2-ce47-4160-b467-43028631d81b commented 12 years ago

Hello Petri

Qmail manpage does not sound as a valid reference for me, I've pointed relevant RFC (which dictates correct behaviour) as a reference, python mbox parser does not conform to it.

Best regards, Valery Masiutsin

On Sun, Jun 24, 2012 at 6:41 PM, Petri Lehtinen \report@bugs.python.org\wrote:

Petri Lehtinen \petri@digip.org\ added the comment:

It seems to me that "^From " is the correct way to match the start of each message. This is also what the qmail manual page (linked in the previous message) says. So closing as invalid.

---------- nosy: +petri.lehtinen resolution: -> invalid stage: test needed -> committed/rejected status: open -> closed


Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue11728\


akheron commented 12 years ago

Actually, you're right. Sorry for overlooking the RFC. But that said, the RFC itself refers to the same manpage as a reference that's "mostly authoritative for those variations that are otherwise only documented in anecdotal form". So I guess it's quite a good reference after all :)

In Appendix A, RFC 4155 defines a set of rules for a "default" mbox format that maximizes interoperability between different mbox implementations.

The important things in the RFC concerning this issue are:

Because the RFC states that there must be an empty line after each message, and it aims for maximum interoperability, I think we can assume that there always is an empty line there. But looking for "\n\nFrom " is not enough for finding the starting points of messages. We should actually parse the whole separator line which consists of "From ", an email address (addr-spec in RFC 2822), a timestamp (in UNIX ctime format without timezone), and a newline character.

I think this should be the default mode for reading mbox files. See bpo-13698 for adding support for other formats.

akheron commented 12 years ago

Some thoughts on doing "clever tricks" to enhance mbox parsing:

http://www.jwz.org/doc/content-length.html