jtmoon79 / super-speedy-syslog-searcher

Speedily search and merge log messages by datetime
MIT License
34 stars 1 forks source link

read `.xz` file by requested block #12

Open jtmoon79 opened 1 year ago

jtmoon79 commented 1 year ago

Problem

An .xz file is entirely read during BlockReader::new. This may cause problems for very large compressed files (the s4 program will hold the entire uncompressed file in memory; it would use too much memory).

The crate lzma-rs does not provide API xz_decompress_with_options which would allow limiting the bytes returned per call. It only provides xz_decompress which decompresses the entire file in one call. See https://github.com/gendx/lzma-rs/issues/110

Solution

Read an .xz file per block request, as done for normal files.


Update: see Issue #283

Meta-Issue #182

jtmoon79 commented 1 year ago

Similar to Issue #13

jtmoon79 commented 1 year ago

The current code: https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/readers/blockreader.rs#L932-L943

It uses https://github.com/gendx/lzma-rs/releases/tag/v0.2.0

The problem is due to lzma-rs crate not providing the uncompressed file size. But uncompressed file size must be known before BlockReader::new returns.

The xz format description reads

Uncompressed Size This field is present only if the appropriate bit is set in the Block Flags field (see Section 3.1.2).

So a decent partial fix is to manually check of uncompressed size is available, that is, check not using lzma-rs, but by jumping to different bit and byte offsets and processing the raw data.


The format has the caveat...

It should be noted that the only reliable way to determine the real uncompressed size is to uncompress the Block, because the Block Header and Index fields may contain (intentionally or unintentionally) invalid information.

In this sense, the current hacky implementation is guaranteed to be correct.

jtmoon79 commented 1 year ago

Another way this issue manifests is reading too many blocks for files without syslines.

File eipp.log.xz has decompressed content like

Package: software-properties-common
Architecture: all
Version: 0.99.9.8
APT-ID: 71737
Status: installed
Depends: ca-certificates, gir1.2-glib-2.0, gir1.2-packagekitglib-1.0 (>= 1.1.0-2), packagekit, python-apt-common (>= 0.9), python3, python3-dbus, python3-gi, python3-requests-unixsocket, python3-software-properties (= 0.99.9.8), python3:any
Breaks: python-software-properties (<< 0.85), python3-software-properties (<< 0.85)

Package: liberror-perl
Architecture: all
Version: 0.17029-1
APT-ID: 2280
Multi-Arch: foreign
Status: installed
Depends: perl:any

Package: libpng16-16
Architecture: amd64
Version: 1.6.37-2
APT-ID: 3339
Multi-Arch: same
Status: installed
Depends: libc6 (>= 2.29), zlib1g (>= 1:1.2.11)

s4 reads all 4 blocks (after compression) from this file.

• s4 /var/log/apt/eipp.log.xz  -s
WARNING: no syslines found "/var/log/apt/eipp.log.xz"

Files:

File: /var/log/apt/eipp.log.xz (XZ) MimeGuess(["application/x-xz"])
  Summary Printed:
      bytes          0
      lines          0
      syslines       0
      datetime first None Found
      datetime last  None Found
  Summary Processed:
      file size compressed   31592 (0x7B68) (bytes)
      file size uncompressed 201425 (0x312D1) (bytes)
      bytes          201425
      bytes total    201425
      block size     65535 (0xFFFF)
      blocks         4
      blocks total   4
      blocks high    4
      lines          2334
      lines high     2334
      syslines       0
      syslines high  0

Notice Summary Processed: blocks 4.

For plain log files, the BlockZero analysis would stop processing after the zeroth block (first block) did not have any apparent syslines, e.g. Summary Processed: blocks 1.

For very large files, this is a lot of overhead for naught, and may cause problems where computer memory is constrained.

~Reading too many blocks increases likelihood of an errant match, e.g. a datetime string within some message that is mistakenly interpreted as a sysline.~ (should be fixed; only zeroth block is analyzed for datetime substrings).

jtmoon79 commented 1 year ago
jtmoon79 commented 4 months ago

Update: see Issue #283


A good solution for this Issue and Issue #13 would be having a "sequential read mode" for SyslogProcessor that is also handed down to SyslineReader, to LineReader, and to BlockReader.

In "sequential read mode" mode, there is no binary search for syslines, only reading the file from start to finish. This would allow "progressive" dropping of data at different points. The BlockReader would, during the search for datetime filter A, somehow know to drop Blocks from N - 2 ago... or something like that. Essentially, it's during the phase of finding the first syslog message acceptable to datetime filter A that Blocks would be dropped while searching (and Lines, Syslines).

This should be relatively clean to implement. There would be two paths for searching for the datetime filter A, binary and linear/sequential.

...

except for this one complicating detail from my comment above:

The problem is due to lzma-rs crate not providing the uncompressed file size. But uncompressed file size must be known before BlockReader::new returns.

I should just grab that raw data myself. It would simplify stuff. From xz format definition 1.2.0

3.1.4. Uncompressed Size

The Uncompressed Size field contains the size of the Block after uncompressing. ... It should be noted that the only reliable way to determine the real uncompressed size is to uncompress the Block, because the Block Header and Index fields may contain (intentionally or unintentionally) invalid information.

Maybe just decompress the entire file once without saving it, to get the uncompressed size. Currently, the entire file is read once and saved during Blockreader::new.

This proposed implementation means the entire file is read twice, at most. However, the amount of runtime memory required would be a constant of the BlockSz, instead of at least the size of the uncompressed file. I think that's a smarter trade-off.


Also, I could delete one bullet point from the README.md

jtmoon79 commented 2 months ago

Cannot read the xz file in chunks/blocks. The crate lzma-rs does not provide API xz_decompress_with_options. See https://github.com/gendx/lzma-rs/issues/110

Consider https://docs.rs/xz2/latest/xz2/read/struct.XzDecoder.html

jtmoon79 commented 1 month ago

283 refactors handling .xz. However the problem remains of reading the entire file during an open.