oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 39 forks source link

Improve zone-bundle creation performance to reduce stop-instance response times and avoid timeout situations #5236

Open askfongjojo opened 8 months ago

askfongjojo commented 8 months ago

5235 uncovered a potential need for zone bundle processing performance improvement. The failure mode may be more common than we think when user spins up a large number of long-running worker instances and spins them down en masse, resulting in concurrent zone-bundle requests on propolis zones that all have a large number of propolis log files to be tar-ed up.

Here are the relevant comments from the customer ticket that provide more context to the possible solutions to this issue:

@gjcolombo

All zone bundle collections are serialized with respect to each other: instances share a single ZoneBundler that's owned by the InstanceManager, and that has an Arc<Mutex>, which lock is required to collect a bundle.

@bnaecker

Bundle generation probably can be faster, since I did nothing to guarantee their performance. It's copying a bunch of log files from the U.2s into a tarball, which is probably what takes the most time, since there tend to be a lot of files given our log rotation policy. I don't think that itself can be parallelized naively, but I'm sure there's something we can do.

The other thing that may take time is running commands in the zone. It shouldn't take too long, since we run pretty standard things like svcs and netstat -an, but I imagine there's always a way for that to take longer than we want under Tokio.

As for parallelization, I'm not sure. The zone itself needs to be owned while we're taking the bundle (at least, without some invasive changes to the code), to ensure it doesn't disappear. When I wrote this, there was a single lock around the whole sled-agent map for instances, which we also need to take for this to work. I think that stuff may have changed recently, with Sean's work to put the instance runner in a separate task, but I've not looked. There may be opportunities for parallelism now.

bnaecker commented 8 months ago

I've been thinking about this a bit in the background, and wanted to collect those ideas somewhere, even if half-baked.

Creating a zone bundle requires collecting two kinds of information: the output of a few commands, and log files. The former require the RunningZone object used throughout the sled-agent to represent a live zone. That's so that (1) we can run commands inside that zone and (2) we ensure that the zone isn't removed while we're bundling. The log files don't strictly require the zone -- we need a few pieces of information from it to find the files, but then we create ZFS snapshots from the filesystems containing the log files and copy the files from there. Once we've created that snapshot, the RunningZone is no longer required.

That's important, because gathering the log files takes the vast majority of the time needed to make the bundle. While I was developing this, I generally found taht the commands would all be completed within a few seconds. The log-file collection operates on a few (or few tens) of files per second, and so would often take much longer than running the commands. It's also less bounded, since the duration depends on the uptime of the zone, since more logs are produced and files rotated the longer the zone is alive.

All this is to say that we could decouple running the commands from collecting the log files. We might hold a reference to the RunningZone for the first part, and then drop it after we create the snapshot, but before starting to actually copy the files.