Build a CLI tool and/or read-only mode

banks commented 1 year ago

Users of HashiCorp products currently have workflows with BoltDB based LogStore that involve using bbolt CLI tool to read logs.

BoltDB takes an exclusive file lock though when openeing for ReadWrite (unlike LMDB which allows other process to read concurrently and uses shared memory to coordinate external readers and a single writer process). So this means stopping a server just to debug the activity on the cluster.

Long term our products have other plans to solve this observability issue such as improved audit logging and application-level events, but it could be useful operationally to have a tool that can read log entries directly from a WAL and ideally without stopping the writing process.

Since we use BoltDB for meta data we still have the same problem as the BoltDB LogStore that the process writing the WAL has an exclusive lock. But since we only use it for limited metadata it might be possible to work around this.

Possible Solution Sketch

I considered a model where we duplicate the meta into another file format that can be read concurrently. This is possible and not even that hard but does add a bunch of additional code etc. just for this.

A simpler approach would be to assume that most of the time except right after a truncation or crash, the log segments in the filesystem are all going to be valid and anyway all represent "stuff that got written to this log" even if they were later truncated. So for most users purposes (i.e. debugging operations) it's not that important that we strictly limit ourselves to only returning entries that are transactionally consistent with the primary process performing truncations.

So we could allow simply reading the segment files as they are. 99.99% of the time this will work right and we can probably just detect times when it doesn't and return an error like "inconsistent log segments found, try again" or something.

Cases where straight reading of the log segments wouldn't work:

Just after whole log truncation (e.g. on a snapshot restore). Old segment files might still be present, while the new segment has started at a possibly overlapping index. In this case there are too logically independent logs so we could either error if we detect overlapping or disjoint segments or potentially we could be smart and select the one with the highest set of segment IDs as the "right" ones.
Just after a tail truncation. In practice it's extremely unlikely that tail truncations delete even one whole segment let alone multiple since they can only happen during leadership instability in raft and our raft library will typically not allow partitioned leaders to write for more than a few milliseconds before they step down. But at least in theory an abitrary number of segments may have logically been removed from the tail of the log and replaced by one or more new ones that logically overlap. As above, we could either detect the overlap and error, or we could be "smart" and take the log span that covers the set of segments with the highest IDs.

I imagine this could be implemented as a separate package called debug or logdump or something that makes it clear that it's intended use is for "seeing what's inside" and not necessarily transactionally correct read access to a live log.

That package could probably just user the segment package interfaces to list the segment files, read their meta data, decide which ones are the "active" ones and then read the requested range.

A CLI UX might be something like:

waldump [-after 12345]
<< JSON log entry per line >>

The log entry data could then be pulled out with jq or other command line tools and interpretted by application-aware scripts e.g. to decode Consul byte prefixes and msgpack/proto messages.

We could implement live tailing where we poll for new entries on the last segment file and print them until it is sealed and then go back to reload the whole dir to find the new segment etc. But this would be quite a lot more work and since this is only for occasional debugging, operators could emulate that by shelling out to the CLI above in a loop every few seconds and increasing the -after index each time to be the highest one seen so far...

banks commented 1 year ago

One complication here is that the current interface assumes the WAL knows the segment info and provides it up front for each segment.

We don't need to change that, but we would need something like a SegmentDumper that could take a segment file and infer the meta data by reading it. That also means that it wouldn't know if the segment is sealed or not, and wouldn't know the index offset even if it was. We could try and work that out by reading backwards but the format wasn't designed for that. The simpler alternative would be just to read every segment as if it is a tail with logic similar to Writer.recoverTail. That would be less efficient though.

We could fix the above in a backwards compatible way:

Add a frame type called INDEX_TRAILER or something which just stores the starting offset of the index.
Write that after the index frame

Current WAL reading code would just ignore it correctly.

If we try to read a segment without metadata we can:

See if the last few bytes of the file contains an INDEX_TRAILER and COMMIT frame (this will be typical for a sealed segment after the changes are made).
If there is an INDEX_TRAILER, open with a normal Reader logic and use the index for direct record access.
If there is not, either the file is not sealed yet, or was sealed before the change without an INDEX_TRAILER, or possibly was sealed before it was full due to a truncation or something (so there is an index but it's not at the end of the preallocated file space). In any case revert to reading through it forwards like we do for the tail file.

... but we probably don't need to optimize this right now for a debugging use-case. Reading 64MiB per file on a modern server with SSD for debugging purposes is not that big a deal!

banks commented 1 year ago

After a bit more thought, a CLI UX like this might be simplest:

# Dumps meta-data about each segment based on the segment file headers (not the actual wall meta)
$ waldump segment-info <wal dir>

# Dumps all records in all segments in file name order (most of the time this is the whole log but if there are truncations or odd things we just output them anyway and operator can deal with it).
$ waldump all-segments <wal dir> [-after INDEX] [-before INDEX]

This seems simple enough for common case - you can even tail the dir by repeatedly calling the last command with -after. We'd uses the segment file names as a basic filter to avoid parsing in the whole set of files just to read the tail but we'd just read in the whole tail segment each time since that's probably not that expensive for a debugging operation on modern hardware so not worth optimizing.

We could have another that dumps a single segment file with metadata/stats and optionally all log contents, but the two above are probably enough for most debugging needs and could be post-filtered by other programs to achieve summarisation etc.

banks commented 1 year ago

Fixed by #26

hashicorp / raft-wal

Build a CLI tool and/or read-only mode #25

Possible Solution Sketch