jtmoon79 / super-speedy-syslog-searcher

Speedily search and merge log messages by datetime
MIT License
45 stars 2 forks source link

refactor datetime searching and file processing to support "forward seek" mode or "random seek" mode #283

Closed jtmoon79 closed 5 months ago

jtmoon79 commented 6 months ago

Current Behavior

This project's original design used a binary search algorithm to quickly find log messages within a user-passed datetime window. The main benefit of this complication was to avoid unnecessary disk reads. i.e. instead of reading the entire file from disk, only a subset of Blocks were read from disk (defaulting to 64KB-sized Blocks). This supported the author's originating scenario where very large log files (GB sized) were read over a low-bandwidth high-latency network connection from an overburdened SMB share. This binary search design affects structs SyslogProcessor, SyslineReader, and BlockReader.

However...

Having random file seeks (seeks that may go backwards) does not dovetail with compressed files; compressed files must always be read from the beginning of the file up to the requested file offset. (see https://github.com/jtmoon79/super-speedy-syslog-searcher/issues/12#issuecomment-2016681186).

This creates a problem when performing a binary search: any "jumps backwards" must either

and

Suggested behavior

Some entity should decide if a file can be read with "random seeks" or "sequential seeks". The overlying users of BlockReader, (SyslogProcessor and it's SyslineReader) must know the datetime search strategy to use. So some entity must decide on the "search mode" ahead of time, most likely a function in filepreprocessor.rs.

Then those entities that do the search for the datetime nearest the beginning of the user-passed datetime window, those entities would perform the search strategy that is appropriate, i.e. "sequential read mode" implies a linear search from the file beginning, "random read mode" implies a binary search.

Other

Given a sequential read mode, this is one step toward fixing the problem highlighted by Issue #12, as the file size of the uncompressed file would not need to be known. If the file size does not need to be known then this is one step toward implementing Chained Block Reads (Issue #14).

Also, reading in a "forward only" linear read mode has an incremental approach will be necessary for Issue #7.

Also, a sequential/linear read mode means Blocks can be progressively dropped as the file is processed. This would improve the problem of high memory usage for compressed files (Issue #182).

This search strategy difference is also hinted-at in various code comments; syslogprocessor.rs#L822-L828, syslinereader.rs#L2474-L2520.