jtmoon79 / super-speedy-syslog-searcher

Speedily search and merge log messages by datetime
MIT License
31 stars 1 forks source link

improve memory usage for archived or compressed files #182

Open jtmoon79 opened 10 months ago

jtmoon79 commented 10 months ago

Summary

This is a "meta issue" to link to specific issues around a chronic issue.

Current behavior

Currently, some files are entirely read into memory. When dealing with large files and/or many files then the process can use too much memory and possibly crash due to Out Of Memory errors (OOM).

Reading from a large amount of archived files was the original motivating use-case for this project, i.e. many "tars of logs" from related systems. It's too bad this use-case can cause OOMs.

Specifically, these circumstances read much or all of a file into a memory at one time (memory use is O(n))


XZ

.xz files must be entirely read into memory during Blockreader::new
This is due to lzma_rs::xz_decompress reading the entire file in one function call. There is not an API that reads chunks of data for some requested amount of bytes.

Noted in https://github.com/gendx/lzma-rs/issues/110


BZ2

The entire .bz2 file is uncompressed to get it's uncompressed size before processing.

See #300


GZ

Fixed in 02261be4a6779a73ee08a3075bccb5effc31818f See https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.7.73/src/readers/blockreader.rs#L2861-L2884

~.gz files must be entirely read into memory up to the requested offset. This may be a problem when a user passes a datetime filter --dt-after for a large file. The searching algorithm for plain text files is a binary search algorithm. That searching algorithm is used for the decompressed data. The search for the first datetime after that --dt-after value may require reading from disk at least half of the uncompressed file (if the first datetime in block 0 is before the passed --dt-after value). However, before that search among the decompressed data can occur, the all data prior to the requested file offset must be read (and is held in memory). This is due to how gzip compresses data as a stream. In other words, you cannot decompress a block of compressed data at an arbitrary offset without first decompressing all preceding blocks of compressed data in the "gzip data stream".~

~If a user does not pass --dt-after then gzip decompressed data is read from block offset 0 and then as further blocks are read, old blocks are dropped (so memory usage does not scale to the size of the file).~


TAR

.tar file contained file must be entirely read into memory during the first call to BlockReader::read_block_FileTar, e.g. the entire file syslog from logs.tar but not the entire file logs.tar.

See #13


EVTX

*.evtx files must be entirely read into memory due to "out of order" events.

Relates to #86


Suggested behavior

Have some O(1) ceiling for memory usage for all cases.

Relates to #14

jtmoon79 commented 10 months ago

The current implementation for xz files

SyslineReader::find_sysline_at_datetime_filter binary search

The SyslogProcessor::find_sysline_between_datetime_filters calls SyslineReader::find_sysline_at_datetime_filter which does a binary search over the entire file during the "find datetime" stage (Stage2FindDt). This stage occurs before any log messages are printed to the console (during stage Stage3StreamSyslines). The intention here is that when a user passes a --dt-after value then the first qualifying value can be found quickly by skipping the reading and processing of unnecessary data (any syslines before the --dt-after datetime).

This was one of the original insights that motivated writing a custom BlockReader and it's attendant classes; very large log files read over slow communication links could be search very quickly for log messages within some requested period of time.

underlying BlockReader and drops

The underlying BlockReader does not know which stage is currently in progress (as it should not; stages are only a concept to the SyslogProcessor). So the BlockReader does not know when it should drop any held blocks. Nor could the BlockReader decide to drop it's own blocks because the user of the BlockReader, the LineReader holds BlockP points to the Blocks held by the BlockReader. In other words, even if the BlockReader decided to be clever and drop some of the Blocks it is storing, there would still be pointers to those Blocks (an Arc<Vec<u8>>) held by the LineReader instance. And the BlockReader should not try to get into signaling the LineReader to drop those BlockP pointers (a BlockReader should know nothing about a LineReader).

Signaling the BlockReader to drop data is currently initiated by the driver function in bin.rs, exec_syslogprocessor. The calls to SyslogProcessor::drop_data_try is done during the log printing stage Stage3StreamSyslines.

review of current implementation

To review: during the binary search find_sysline_at_datetime_filter done early on during stage Stage2FindDt (again, only needed if the user passes --dt-after) there are no attempts to tell the BlockReader to drop old data. And during this stage, it is possible for an entire .xz file or an entire .gz file to be held in memory if, for example, the user passes --dt-after datetime value that is after all log messages in the decompressed log data. Or, if a user passes a --dt-after datetime value that is after the first sysline of the file then the binary search of find_sysline_at_datetime_filter will jump to the middle of the file.

Idea: BlockReader only holds N blocks at a time

So the next idea would be to modify the BlockReader so that it only ever holds some N amount of Blocks

Problem

But that's difficult because remember the controlling LineReader hold it's own copies of pointers to those Blocks as well (BlockP instances) so the underlying Vec<u8> wouldn't drop as there are outstanding pointers to it. Additionally, it would be a bad design to try to have the BlockReader callback to the LineReader (now the BlockReader needs to understand what a LineReader is and that get's too messy).

Idea: BlockReader proactively drops blocks when handling xz files

Another idea is to modify the BlockReader::read_block_FileXz such that it proactively drops blocks that are not needed. Function read_block_FileXz does a linear search, storing Blocks as they are decompressed, and passes back a single BlockP for requested Block at BlockOffset. So the BlockReader could quietly drop "old" Blocks of data as new Blocks are decompressed during the read_block_FileXz function.

For example, given a call to read_block_FileXz(9) to read the tenth Block of data, that requires decompressing all prior Blocks of data (as xz algorithm can only linearly stream compression/decompression). The function read_block_FileXz could be modified such that after Block two is read then Block one is dropped, and so on, up to the tenth Block (so the prior nine decompressed Blocks would have been dropped).

Problem

But then the problem is another call from the overlying SyslineProcessor within binary search function find_sysline_at_datetime_filter may later call read_block_FileXz(5). If Blocks one through six were dropped previously, then the entire file must read and decompressed again (reading the sixth Block requires reading the prior five Blocks). So the initial binary search with O(log(n)) time now becomes O(n*log(n)) time.

Idea: .xz files are only searched linearly

If the binary search during stage Stage2FindDt within SyslineProcessor::find_sysline_at_datetime_filter is instead a linear search then the underlying BlockReader could be modified to only keep N Blocks at a time. This would satisfy the .xz decompression need for streaming the data (reading all preceding data linearly). He same could be done for .gz files.