log2timeline / plaso

Super timeline all the things
https://plaso.readthedocs.io
Apache License 2.0
1.7k stars 334 forks source link

Add parser for mbox #2151

Open joachimmetz opened 5 years ago

joachimmetz commented 5 years ago

Taken from the Plaso - Roadmap and Assignment

s-martinelli commented 5 years ago

Hi, I would like to develop this parser. Do you already have any idea about how it should be? I have some doubts about how to validate a MBox file: I currently check on the file extension and how it starts ("From a ..."). Thanks in advance.

Onager commented 5 years ago

Hey @stefanomart - this approach to validating an mbox file sounds OK, but I'd recommend dropping the file extension check.

Given that mbox is a text file format, I recommend create a subclass of the text parser - probably PyParsingMultilineTextParser

Onager commented 5 years ago

I also note there are several mbox-relateds formats: https://en.wikipedia.org/wiki/Mbox

Onager commented 5 years ago

https://docs.python.org/2/library/mailbox.html#mbox looks helpful for handling mboxo files

s-martinelli commented 5 years ago

Thanks for the reply. I'm using the mailbox module to get any information. I noticed that an email can have more than one field with the same name (like the field received). Using the subclass mbox I can get only the first occurrence of a field that is more than once in a same mail. Do you have some advice? EDIT: Problem solved.

s-martinelli commented 5 years ago

@Onager Hi, I'm using the mailbox library to parser some mbox file. The class mailbox.mbox wants as parameter the file's path, not the file like (file_object in plaso). Would this be a problem? Thanks in advance.

joachimmetz commented 5 years ago

Would this be a problem?

Yes, plaso uses dfVFS to read files directly from storage media images and devices. Your parser will need to be able to work with a file-like object.

There are workarounds like copying the file to a temporary location, however this is not recommended.

joachimmetz commented 5 years ago

To be verbose, please also read https://github.com/log2timeline/plaso/wiki/Adding-a-new-dependency.

s-martinelli commented 5 years ago

Would this be a problem?

Yes, plaso uses dfVFS to read files directly from storage media images and devices. Your parser will need to be able to work with a file-like object.

There are workarounds like copying the file to a temporary location, however this is not recommended.

Would it be possible to use the mailbox library partially modified by me? In order to have as a parameter a file-like object and not a path.

joachimmetz commented 5 years ago

Would it be possible to use the mailbox library partially modified by me?

For your own set up this should not be a problem.

If you want to have your code integrated into plaso. I would recommend to get your changes merged into upstream (of the mbox library). Unless you want to maintain a fork?

Onager commented 5 years ago

mbox is in the standard library, so getting an upstream change in might be challenging. It might be possible to just add a subclass of the standard library mbox that takes a file-like object, and use that in your parser: https://github.com/python/cpython/blob/3.6/Lib/mailbox.py

joachimmetz commented 5 years ago

Various standard library modules of Python are maintained as stand-alone projects. Not sure but the API of this project looks very similar https://pypi.org/project/mailbox/. But then again the github link looks very different.

s-martinelli commented 5 years ago

Thank you for the replies. I have already made the parser for mbox using the mailbox library. I could add the logic of this library into plaso, eliminating all the irrelevant and potentially dangerous parts (like the writing to file). In this way you don't risk adding code with potential errors, and I could continue the work I started. Moreover, the mailbox library is in use since 1994, so we can be sure it's battle-tested and handles all the corner cases correctly.

joachimmetz commented 5 years ago

could add the logic of this library into plaso, eliminating all the irrelevant and potentially dangerous parts (like the writing to file).

Sounds like the more feasible option then ;)

Moreover, the mailbox library is in use since 1994, so we can be sure it's battle-tested and handles all the corner cases correctly.

I'm sceptical here, about handling "all" edge (corner) cases, but likely a fair bit. It is nice that it's part of the core, which means no additional changes for deployment.

s-martinelli commented 5 years ago

I'm using part of mailbox's logic and I've a problem with the file object (dfvfs.FileIO): because this class doesn't implement the method readline() , I created that but it's way slower than File's readline(). I also tried to implement the method based on the logic of text_file's readline(), but the results are not good anyway. Why dfvfs.FileIO doesn't have readline()? By the way, do you have any suggestions?

joachimmetz commented 5 years ago

Why dfvfs.FileIO doesn't have readline()?

Because it is not needed most of the time and readline is not a file IO primitive.

I also tried to implement the method based on the logic of text_file's readline(), but the results are not good anyway.

based on this comment I assume you tried https://github.com/log2timeline/dfvfs/blob/master/dfvfs/helpers/text_file.py? could you be more specific in terms what way the results are not good?

s-martinelli commented 5 years ago

Yes, I tried the logic of text_file's readline. These are some results which I obtained with some mbox files:

1 File (21KB):

2 File (444KB):

3 File (11MB):

You can see what will happen with very heavy files..

Onager commented 5 years ago

I wouldn't worry too much about that stefanomart@, Plaso's run time is pretty long, and the extra time for I/O is the price we pay for having a dfvfs to handle all the different file systems/file storage formats.

s-martinelli commented 5 years ago

Hello again! I noticed another problem with textfile's readline: this method decodes every line in UTF-8. Some mbox files are mixed-encoded; when I try to parse these mbox files I got this error: 'utf8' codec can't decode byte 0xe8 in position 65_.

def __init__(self, file_object, encoding='utf-8', end_of_line='\n'):

Removing the piece of code that decodes every line the problem disappears.