micronode / mstor

Messaging store and archival tool
https://www.mstor.org
Other
11 stars 2 forks source link

Sometimes the beginning of an email is not recognised in getMessagePositions() in /fortuna/mstor/data/MboxFile.java. #27

Open jfarwer opened 2 years ago

jfarwer commented 2 years ago

Hi,

I think there is a slight issue in getMessagePositions() in /fortuna/mstor/data/MboxFile.java.

An Mbox file is split into managable chunks of DEFAULT_BUFFER_SIZE. For each chunk a search is done for the pattern for the beginning of an email: (\\A|\\n{2}|(\\r\\n){2})^From .*$).

When creating the chunks, the last seven characters (FROM__PREFIX.length() + 2) from each chunk are added to the beginning of the next chunk. This is to avoid losing a From_match due to splitting through the pattern when creating the chunks. Seven characters cater for any split within \n\nFrom_.

However when \r\n\r\nFrom_ is split between the 'm' and the blank then we are left with \n\r\nFrom_. The pattern \\A|\\n{2}|(\\r\\n){2})^From .*$ doesn't match this and the beginning of the email is skipped. Maybe adding \n\r\nFrom_ to the pattern would fix it? This would also cater for the case where an Mbox file contains mixed newline styles (and possibly \n\r\nFrom_ at the beginning of an email).

benfortuna commented 2 years ago

Thanks for the feedback. I'll review the pattern and try to create a test to reproduce this issue.

jfarwer commented 2 years ago

Thanks for the quick response. We are working with sets of a few hundred thousand emails; it would be great to have it fixed. Unfortunately, I can't give you my files for testing for privacy reasons.

benfortuna commented 2 years ago

I had a closer look and there is already support for overlapping buffers to prevent loss of the From_ header.

However there was a bug that it wasn't accounting for enough overlap bytes to include the whitespace. I've applied a fix and released as v1.0.1.

As this is the first release in quite a while, there may be some unexpected behaviour changes, but if you can try and provide feedback I'll address as needed.

jfarwer commented 2 years ago

Great, that was quick. I will give feedback to you when I have done some tests.

jfarwer commented 2 years ago

That is looking good. In mstor-1.0.1.pom I had to change the occurrences of <scope>runtime</scope> to <scope>compile</scope>, but that is just because my code uses the dependencies itself without having them in its own pom file.

For using the MStorStore class I had to change net.fortuna.mstor.MStorStore to net.fortuna.mstor.model.MStorStore

I will let you know if I experience anything unexpected.

jfarwer commented 2 years ago

Hi,

Sorry for being a pain. I think the pattern (\\A|\\n{2}|(\\r\\n){2})^From .*$ looks for either the beginning of the string or two Unix newlines or two Windows newlines. I guess the beginning of the string is supposed to cater for the beginning of the Mbox file where there are no newlines in front of the From_. However, reading is done in chunks of 8192 characters, and if a chunk starts with From_, that matches the pattern. I have a large Mbox file with a From_ in a line of email text which is taken as the start of an email. I think this is not a new issue. It didn't crop up before because a different overlap was used, meaning the beginning of the chunks was different, and it didn't happen in the Mbox file I was looking at.

stleusc commented 2 years ago

I am using the current version and I get this issue: One of the message objects has null ID, null Subject, etc. Enabled logging and found this:

[main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16633060] [main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16644222] [main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16644222] [main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16650269]

see the duplicate index?! For some reason the same index is used twice and the second time the message object is broken. The second should not exist at all... Pretty sure related to this...

benfortuna commented 2 years ago

Ok that is starting to make sense. I guess if two concurrent messages are less than 8192 bytes then you could potentially have the same message in consecutive buffers. I'll add a fix to ensure duplicate indexes aren't added.

stleusc commented 2 years ago

I just debugged it a bit.... The first time it finds the message is at the end of the buffer and the buffer ends in: \r\n\r\nFrom_ It ends with and EXACT hit. Then the next loop you add that EXACT same text to the beginning of the next buffer which results in another hit.

I believe you detect the length of msg by looking at next or prior offset from posList and take the diff. That means for one message the difference is simply 0! This should be the case for the FIRST 16644222 since the next starts at the same index the length for this object is 0.

That means, once you fix the issue of the double match the other issue cannot happen anymore. I would not even consider it a bug seeing what I see.

So, fixing the issue with match at the very end is all I would need.

stleusc commented 2 years ago

I think I might even have a possible fix...

Current code

private static final Pattern FROM__LINE_PATTERN = Pattern.compile("(\\A|\\n{2}|(\\r\\n){2})^From .*$", 8);

Proposal

private static final Pattern FROM__LINE_PATTERN = Pattern.compile("(\\A|\\n{2}|(\\r\\n){2})^From .+$", 8);

Your pattern matches \r\n\r\nFrom_ with NOTHING more. But since you add that exact pattern for another round you should not match it here. This can be fixed by requiring at LEAST 1 more character. That way, you NEVER can have a double hit and the above case would NOT match at the end of the buffer but will be caught the next round.

stleusc commented 2 years ago

With my change: [main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16633060] [main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16644222] [main] DEBUG net.fortuna.mstor.data.MboxFile - Found match at [16650269]

:-)

benfortuna commented 2 years ago

Yup I will add this change. In the meantime i pushed a crude fix to just check the last message position to avoid adding it to the index a second time. Release is just pushed should be on maven central shortly.