This project's original design used a binary search algorithm to quickly find log messages within a user-passed datetime window.
The main benefit of this complication was to avoid unnecessary disk reads. i.e. instead of reading the entire file from disk, only a subset of Blocks were read from disk (defaulting to 64KB-sized Blocks). This supported the author's originating scenario where very large log files (GB sized) were read over a low-bandwidth high-latency network connection from an overburdened SMB share.
This binary search design affects structs SyslogProcessor, SyslineReader, and BlockReader.
This creates a problem when performing a binary search: any "jumps backwards" must either
A. read cached data Blocks, or
B. re-read the file starting from the beginning
and
A. implies that nearly the entire file may be kept in memory at one time (a problem when dealing with, for example, multiple GB-sized log files, which was the situation in the author's originating scenario; see Issue #182)
B. implies a O(log2(n) * n) read time instead of a O(log2(n)) (binary search) or O(n) (linear search), (n being the size of the file)
Suggested behavior
Some entity should decide if a file can be read with "random seeks" or "sequential seeks".
The overlying users of BlockReader, (SyslogProcessor and it's SyslineReader) must know the datetime search strategy to use. So some entity must decide on the "search mode" ahead of time, most likely a function in filepreprocessor.rs.
Then those entities that do the search for the datetime nearest the beginning of the user-passed datetime window, those entities would perform the search strategy that is appropriate, i.e. "sequential read mode" implies a linear search from the file beginning, "random read mode" implies a binary search.
Other
Given a sequential read mode, this is one step toward fixing the problem highlighted by Issue #12, as the file size of the uncompressed file would not need to be known. If the file size does not need to be known then this is one step toward implementing Chained Block Reads (Issue #14).
Also, reading in a "forward only" linear read mode has an incremental approach will be necessary for Issue #7.
Also, a sequential/linear read mode means Blocks can be progressively dropped as the file is processed. This would improve the problem of high memory usage for compressed files (Issue #182).
Current Behavior
This project's original design used a binary search algorithm to quickly find log messages within a user-passed datetime window. The main benefit of this complication was to avoid unnecessary disk reads. i.e. instead of reading the entire file from disk, only a subset of Blocks were read from disk (defaulting to 64KB-sized Blocks). This supported the author's originating scenario where very large log files (GB sized) were read over a low-bandwidth high-latency network connection from an overburdened SMB share. This binary search design affects structs
SyslogProcessor
,SyslineReader
, andBlockReader
.However...
Having random file seeks (seeks that may go backwards) does not dovetail with compressed files; compressed files must always be read from the beginning of the file up to the requested file offset. (see https://github.com/jtmoon79/super-speedy-syslog-searcher/issues/12#issuecomment-2016681186).
This creates a problem when performing a binary search: any "jumps backwards" must either
and
Suggested behavior
Some entity should decide if a file can be read with "random seeks" or "sequential seeks". The overlying users of
BlockReader
, (SyslogProcessor
and it'sSyslineReader
) must know the datetime search strategy to use. So some entity must decide on the "search mode" ahead of time, most likely a function infilepreprocessor.rs
.Then those entities that do the search for the datetime nearest the beginning of the user-passed datetime window, those entities would perform the search strategy that is appropriate, i.e. "sequential read mode" implies a linear search from the file beginning, "random read mode" implies a binary search.
Other
Given a sequential read mode, this is one step toward fixing the problem highlighted by Issue #12, as the file size of the uncompressed file would not need to be known. If the file size does not need to be known then this is one step toward implementing Chained Block Reads (Issue #14).
Also, reading in a "forward only" linear read mode has an incremental approach will be necessary for Issue #7.
Also, a sequential/linear read mode means Blocks can be progressively dropped as the file is processed. This would improve the problem of high memory usage for compressed files (Issue #182).
This search strategy difference is also hinted-at in various code comments;
syslogprocessor.rs#L822-L828
,syslinereader.rs#L2474-L2520
.