google / mtail

extract internal monitoring data from application logs for collection in a timeseries database
Apache License 2.0
3.82k stars 378 forks source link

I'm trying to read files from a GCS bucket in gzip format #867

Closed JWThorne closed 1 month ago

JWThorne commented 3 months ago

This may be an unsupported way to use mtail, but here goes.

I have a GCS bucket mounted in a container as a file system. The files from an external system show up as gzipped json files

Filenames are like logs_20240521_20240521T003517Z_20240521T003619Z_d0afe812.json.gz

I'm seeking a solution to get mtail to read the file in its entirety as they show up.

Problems are:
1> mtail doesn't directly read gzip encoding 2> when a new file shows up, it seeks to the end of the file instead of reading it in its entirety. It ALWAYS seeks to the end instead of reading new files from the beginning. This feature is kind of counterintuitive to me. Several log rotator system use daily or hourly filenames.

I have a perfectly working mtail program, and the statistics work fine. I just need to figure out what to do about the compression and reading files from beginning.

The options I see are either gzcat filename | /proc/pidof mtail/fd/1with shell script in a loop monitoring the filesystem or gzcat filename > constant_file_that_mtail_actually_monitors;

Perhaps with inotify to find new files.

jaqx0r commented 3 months ago

Yep, mtail doesn't support reading any sort of compression. The assumption there is that compression happens after log rotation.

mtails also goes to the end of file because of log rotation, assuming that when it starts it is reading an append-only log. It will read from the start of the file when it detects a log rotation has occurred, e.g. by a rename or a truncation.

Your system looks like you want mtail to read logs not directly from the source but after an archival stage, these logs are already compressed and timestamped and uploaded to the GCS bucket, right?

JWThorne commented 1 month ago

yes. We ended up writing a bit of golang to find the files and gzcat them into a named pipe for the instance. That's working fine once we handled the buffer size issue.

This is working now. Our expectations of what mtail can do are now reset. we'll use it in the manner it was intended.