Open jtmoon79 opened 1 year ago
Similar to Issue #13
The current code: https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/readers/blockreader.rs#L932-L943
It uses https://github.com/gendx/lzma-rs/releases/tag/v0.2.0
The problem is due to lzma-rs
crate not providing the uncompressed file size. But uncompressed file size must be known before BlockReader::new
returns.
The xz format description reads
Uncompressed Size This field is present only if the appropriate bit is set in the Block Flags field (see Section 3.1.2).
So a decent partial fix is to manually check of uncompressed size is available, that is, check not using lzma-rs
, but by jumping to different bit and byte offsets and processing the raw data.
The format has the caveat...
It should be noted that the only reliable way to determine the real uncompressed size is to uncompress the Block, because the Block Header and Index fields may contain (intentionally or unintentionally) invalid information.
In this sense, the current hacky implementation is guaranteed to be correct.
Another way this issue manifests is reading too many blocks for files without syslines.
File eipp.log.xz
has decompressed content like
Package: software-properties-common
Architecture: all
Version: 0.99.9.8
APT-ID: 71737
Status: installed
Depends: ca-certificates, gir1.2-glib-2.0, gir1.2-packagekitglib-1.0 (>= 1.1.0-2), packagekit, python-apt-common (>= 0.9), python3, python3-dbus, python3-gi, python3-requests-unixsocket, python3-software-properties (= 0.99.9.8), python3:any
Breaks: python-software-properties (<< 0.85), python3-software-properties (<< 0.85)
Package: liberror-perl
Architecture: all
Version: 0.17029-1
APT-ID: 2280
Multi-Arch: foreign
Status: installed
Depends: perl:any
Package: libpng16-16
Architecture: amd64
Version: 1.6.37-2
APT-ID: 3339
Multi-Arch: same
Status: installed
Depends: libc6 (>= 2.29), zlib1g (>= 1:1.2.11)
s4
reads all 4 blocks (after compression) from this file.
• s4 /var/log/apt/eipp.log.xz -s
WARNING: no syslines found "/var/log/apt/eipp.log.xz"
Files:
File: /var/log/apt/eipp.log.xz (XZ) MimeGuess(["application/x-xz"])
Summary Printed:
bytes 0
lines 0
syslines 0
datetime first None Found
datetime last None Found
Summary Processed:
file size compressed 31592 (0x7B68) (bytes)
file size uncompressed 201425 (0x312D1) (bytes)
bytes 201425
bytes total 201425
block size 65535 (0xFFFF)
blocks 4
blocks total 4
blocks high 4
lines 2334
lines high 2334
syslines 0
syslines high 0
Notice Summary Processed: blocks 4
.
For plain log files, the BlockZero analysis would stop processing after the zeroth block (first block) did not have any apparent syslines, e.g. Summary Processed: blocks 1
.
For very large files, this is a lot of overhead for naught, and may cause problems where computer memory is constrained.
~Reading too many blocks increases likelihood of an errant match, e.g. a datetime string within some message that is mistakenly interpreted as a sysline.~ (should be fixed; only zeroth block is analyzed for datetime substrings).
lzma-rs
feature _Expose a new rawdecoder APIUpdate: see Issue #283
A good solution for this Issue and Issue #13 would be having a "sequential read mode" for SyslogProcessor
that is also handed down to SyslineReader
, to LineReader
, and to BlockReader
.
In "sequential read mode" mode, there is no binary search for syslines, only reading the file from start to finish. This would allow "progressive" dropping of data at different points. The BlockReader
would, during the search for datetime filter A, somehow know to drop Blocks from N - 2 ago... or something like that. Essentially, it's during the phase of finding the first syslog message acceptable to datetime filter A that Blocks would be dropped while searching (and Lines, Syslines).
This should be relatively clean to implement. There would be two paths for searching for the datetime filter A, binary and linear/sequential.
...
except for this one complicating detail from my comment above:
The problem is due to
lzma-rs
crate not providing the uncompressed file size. But uncompressed file size must be known beforeBlockReader::new
returns.
I should just grab that raw data myself. It would simplify stuff. From xz format definition 1.2.0
3.1.4. Uncompressed Size
The Uncompressed Size field contains the size of the Block after uncompressing. ... It should be noted that the only reliable way to determine the real uncompressed size is to uncompress the Block, because the Block Header and Index fields may contain (intentionally or unintentionally) invalid information.
Maybe just decompress the entire file once without saving it, to get the uncompressed size. Currently, the entire file is read once and saved during Blockreader::new
.
This proposed implementation means the entire file is read twice, at most. However, the amount of runtime memory required would be a constant of the BlockSz
, instead of at least the size of the uncompressed file. I think that's a smarter trade-off.
Also, I could delete one bullet point from the README.md
- Entire .xz files are read into memory before printing (https://github.com/jtmoon79/super-speedy-syslog-searcher/issues/12)
Cannot read the xz file in chunks/blocks. The crate lzma-rs
does not provide API xz_decompress_with_options
. See https://github.com/gendx/lzma-rs/issues/110
Consider https://docs.rs/xz2/latest/xz2/read/struct.XzDecoder.html
.xz
. However the problem remains of reading the entire file during an open.
Problem
An
.xz
file is entirely read duringBlockReader::new
. This may cause problems for very large compressed files (thes4
program will hold the entire uncompressed file in memory; it would use too much memory).The crate
lzma-rs
does not provide APIxz_decompress_with_options
which would allow limiting the bytes returned per call. It only providesxz_decompress
which decompresses the entire file in one call. See https://github.com/gendx/lzma-rs/issues/110Solution
Read an
.xz
file per block request, as done for normal files.Update: see Issue #283
Meta-Issue #182