Open jtmoon79 opened 10 months ago
SyslineReader::find_sysline_at_datetime_filter
binary searchThe SyslogProcessor::find_sysline_between_datetime_filters
calls SyslineReader::find_sysline_at_datetime_filter
which does a binary search over the entire file during the "find datetime" stage (Stage2FindDt
). This stage occurs before any log messages are printed to the console (during stage Stage3StreamSyslines
).
The intention here is that when a user passes a --dt-after
value then the first qualifying value can be found quickly by skipping the reading and processing of unnecessary data (any syslines before the --dt-after
datetime).
This was one of the original insights that motivated writing a custom BlockReader
and it's attendant classes; very large log files read over slow communication links could be search very quickly for log messages within some requested period of time.
BlockReader
and dropsThe underlying BlockReader
does not know which stage is currently in progress (as it should not; stages are only a concept to the SyslogProcessor
). So the BlockReader
does not know when it should drop any held blocks. Nor could the BlockReader
decide to drop it's own blocks because the user of the BlockReader
, the LineReader
holds BlockP
points to the Block
s held by the BlockReader
. In other words, even if the BlockReader
decided to be clever and drop some of the Block
s it is storing, there would still be pointers to those Block
s (an Arc<Vec<u8>>
) held by the LineReader
instance. And the BlockReader
should not try to get into signaling the LineReader
to drop those BlockP
pointers (a BlockReader
should know nothing about a LineReader
).
Signaling the BlockReader
to drop data is currently initiated by the driver function in bin.rs
, exec_syslogprocessor
.
The calls to SyslogProcessor::drop_data_try
is done during the log printing stage Stage3StreamSyslines
.
To review: during the binary search find_sysline_at_datetime_filter
done early on during stage Stage2FindDt
(again, only needed if the user passes --dt-after
) there are no attempts to tell the BlockReader
to drop old data.
And during this stage, it is possible for an entire .xz
file or an entire .gz
file to be held in memory if, for example, the user passes --dt-after
datetime value that is after all log messages in the decompressed log data. Or, if a user passes a --dt-after
datetime value that is after the first sysline of the file then the binary search of find_sysline_at_datetime_filter
will jump to the middle of the file.
So the next idea would be to modify the BlockReader
so that it only ever holds some N amount of Block
s
But that's difficult because remember the controlling LineReader
hold it's own copies of pointers to those Block
s as well (BlockP
instances) so the underlying Vec<u8>
wouldn't drop as there are outstanding pointers to it. Additionally, it would be a bad design to try to have the BlockReader
callback to the LineReader
(now the BlockReader
needs to understand what a LineReader
is and that get's too messy).
xz
filesAnother idea is to modify the BlockReader::read_block_FileXz
such that it proactively drops blocks that are not needed. Function read_block_FileXz
does a linear search, storing Block
s as they are decompressed, and passes back a single BlockP
for requested Block
at BlockOffset
. So the BlockReader
could quietly drop "old" Block
s of data as new Block
s are decompressed during the read_block_FileXz
function.
For example, given a call to read_block_FileXz(9)
to read the tenth Block
of data, that requires decompressing all prior Block
s of data (as xz algorithm can only linearly stream compression/decompression). The function read_block_FileXz
could be modified such that after Block
two is read then Block
one is dropped, and so on, up to the tenth Block
(so the prior nine decompressed Block
s would have been dropped).
But then the problem is another call from the overlying SyslineProcessor
within binary search function find_sysline_at_datetime_filter
may later call read_block_FileXz(5)
. If Block
s one through six were dropped previously, then the entire file must read and decompressed again (reading the sixth Block
requires reading the prior five Block
s). So the initial binary search with O(log(n)) time now becomes O(n*log(n)) time.
If the binary search during stage Stage2FindDt
within SyslineProcessor::find_sysline_at_datetime_filter
is instead a linear search then
the underlying BlockReader
could be modified to only keep N Block
s at a time.
This would satisfy the .xz decompression need for streaming the data (reading all preceding data linearly).
He same could be done for .gz files.
Summary
This is a "meta issue" to link to specific issues around a chronic issue.
Current behavior
Currently, some files are entirely read into memory. When dealing with large files and/or many files then the process can use too much memory and possibly crash due to Out Of Memory errors (OOM).
Reading from a large amount of archived files was the original motivating use-case for this project, i.e. many "tars of logs" from related systems. It's too bad this use-case can cause OOMs.
Specifically, these circumstances read much or all of a file into a memory at one time (memory use is O(n))
XZ
.xz
files must be entirely read into memory duringBlockreader::new
This is due to
lzma_rs::xz_decompress
reading the entire file in one function call. There is not an API that reads chunks of data for some requested amount of bytes.Noted in https://github.com/gendx/lzma-rs/issues/110
BZ2
The entire
.bz2
file is uncompressed to get it's uncompressed size before processing.See #300
GZ
Fixed in 02261be4a6779a73ee08a3075bccb5effc31818f See https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.7.73/src/readers/blockreader.rs#L2861-L2884
~
.gz
files must be entirely read into memory up to the requested offset. This may be a problem when a user passes a datetime filter--dt-after
for a large file. The searching algorithm for plain text files is a binary search algorithm. That searching algorithm is used for the decompressed data. The search for the first datetime after that--dt-after
value may require reading from disk at least half of the uncompressed file (if the first datetime in block 0 is before the passed--dt-after
value). However, before that search among the decompressed data can occur, the all data prior to the requested file offset must be read (and is held in memory). This is due to how gzip compresses data as a stream. In other words, you cannot decompress a block of compressed data at an arbitrary offset without first decompressing all preceding blocks of compressed data in the "gzip data stream".~~If a user does not pass
--dt-after
then gzip decompressed data is read from block offset 0 and then as further blocks are read, old blocks are dropped (so memory usage does not scale to the size of the file).~TAR
.tar
file contained file must be entirely read into memory during the first call toBlockReader::read_block_FileTar
, e.g. the entire filesyslog
fromlogs.tar
but not the entire filelogs.tar
.See #13
EVTX
*.evtx
files must be entirely read into memory due to "out of order" events.Relates to #86
Suggested behavior
Have some O(1) ceiling for memory usage for all cases.
Relates to #14