Improve performance when parsing large log files

rueckstiess / mtools

A collection of scripts to set up MongoDB test environments and parse and visualize MongoDB log files.

Apache License 2.0

1.88k stars 397 forks source link

Improve performance when parsing large log files #86

Open gianpaj opened 11 years ago

gianpaj commented 11 years ago

For example use multi-threadding

http://docs.python.org/2/library/threading.html http://www.tutorialspoint.com/python/python_multithreading.htm

rueckstiess commented 11 years ago

It would probably have to be multiprocessing rather than multi-threading because of Python's GIL, if I remember correctly.

So the idea would be to spread parsing of log lines over all available cores with separate processes? They can't have shared memory so they'd need to communicate the results back to the main process somehow. We need to test how that would be done efficiently.

gianpaj commented 11 years ago

I remember doing some multi threading in Python and after getting some over some syntax problems, I managed to make it work.

rueckstiess commented 11 years ago

ok but still, multithreading wouldn't increase performance, right? because it would still only be able to use 1 core.

gianpaj commented 11 years ago

I basically i tried this, but i think there are lot of changes that need to be done. I'm trying first in mlogvis.py but because this uses LogLine by calling it on every line of the log file, I split the file into chunks of 10000 lines and start a new process. There are some limitations of function i'm using (multiprocessing.Pool.apply), for example you can't use a class function, and so that i'm going to have to global variables rather than class variables (self.variable)

I know this is horrible, but have a look if you can find anything good in this code: https://github.com/gianpaj/mtools/commit/bbd7b08d24c34c5679c55bd8aaf0c6d73a69a05b

rueckstiess commented 11 years ago

For what it's worth, I've made mlogfilter faster when filtering on dates, by using a binary search rather than linear. With faster, I mean instant. A search on a 400MB log file took > 10 minutes before to do a search with --from and --to, now it is 0.15 seconds. I think that's a nice improvement. :-)

gianpaj commented 11 years ago

awesome!

rueckstiess commented 10 years ago

See also #187, this made log file parsing about 8x faster than before for most tasks. I'll still keep this open and want to give multiprocessing another shot when I get to it.