rueckstiess / mtools

A collection of scripts to set up MongoDB test environments and parse and visualize MongoDB log files.
Apache License 2.0
1.88k stars 399 forks source link

mplotqueries should store references to log lines, not in-ram copies of the whole log line #271

Open devkev opened 10 years ago

devkev commented 10 years ago

mplotqueries stores the original log line along with the parsed info, so that it can output it when points are clicked. However, it would be a lot better to instead store a filename + byte offset (where possible, ie. when reading from a rewindable and/or seekable file), to avoid eating up impossible amounts of memory on very very large logfiles.

Alternatively, when the logfile is an actual file (ie. not a pipe), it could be mmapped, which would potentially allow for faster reading (especially when plotting and replotting the same file over and over), and fast/easy access back to the original log lines without having to use lots of ram.

gianpaj commented 10 years ago

You're right. Probably storing the line number is more than enough. When a point is clicked then just open the file on that line number. Not sure if linecache could work here

rueckstiess commented 10 years ago

Thanks, I like the linecache idea, that sounds like the way to go.

rueckstiess commented 9 years ago

Some notes:

use namedtuples to only store the fields needed, which are:

Issue with grouping. The grouping is currently a function that takes the logevent, and calculates group dynamically. Instead, pre-calculate group value in add_line() (it doesn't change during the lifetime of a single plot_instance), and add and additional field group to the tuple.

What about stdin? Need to additionally store line_str.

devkev commented 9 years ago

stdin is just a special case of a file that can't be seeked. It's also possible to have such a file passed on the command line (eg. using bash's "<()" construct, or using mkfifo).

I would suggest the following approach. Change the rest of the code to not store line_str, but rather the line number of the file, which is used as an indirect reference back into the file. Define an abstract "Logfile" class. This has 3 actual implementations, each of which are tried to be used in turn:

The other approach to dealing with non-seekable files is to cache them into a temporary disk file somewhere, somehow. I dislike this idea, because it means that it becomes mtools's problem as to find a writable location with sufficient disk space to put the temporary file(s), and to clean them up later (which isn't always possible, eg. kill -9). I much prefer the policy that if you have output from a pipe that you want to plot, and it's "large" (as defined by the maximum CachedLogfile cache size above), then it's your job to pipe it into a file and then feed that file to mplotqueries. This pushes the decision of finding a writable location with enough space, and cleaning up the file afterwards, onto the user, but I don't mind that because the user is far better informed than mtools in these regards.