urbit / pm

Core Development project management
9 stars 0 forks source link

release postmortem: urbit-os-v2.142 #8

Open belisarius222 opened 1 year ago

belisarius222 commented 1 year ago

Timeline and Description of Events

@pkova wrote up a description of the events that happened related to the bug in this release: https://gist.github.com/pkova/cbc56cf4840adff4d14887a520ac4faf

Learnings and Future Actions (Stream of Consciousness version)

From @philipcmonk:

re: what kind of tests could have found this, I suspect that if we had let it simmer on develop for a week we would have found it? not necessarily, because most of those ships are under ~zod (and 95% of ships under ~zod are unaffected). maybe we made the wrong choice deploying the ping rewrite as a hotfix since it's such critical infrastructure, but otoh a lot of people were experiencing flaky networking because of the issue. maybe we could have deployed it only to ~dem's children, but it's not obvious that that's better process in general

@belisarius222 I'm not sure letting this bake longer would have caught this. We might need truly random deployment, i.e. 25% of randomly chosen ships get the new release.

@jalehman: Letting this bake longer on the network probably would have caught this. ~lomder-librun was affected, and he chimed in to say he had connectivity issues. We should be very careful about rushing things out the door unless absolutely necessary, and this did not seem absolutely necessary.

@pkova The reason I released this the way I did was that ~dem was at 100% CPU and nothing was working. "I don't care about the risks when nothing fucking works."

@jalehman If just ~dem had applied the fix, everything else would have been just fine. If it had gone out on ~dem, they might have fixed their issue. We as UF should be conservative in what we release to the rest of the network. Anyone who runs their own infrastructure can do whatever they want -- Tlon could have just pushed this fix onto ~dem.

@philipcmonk We've done our own Vere builds, but we haven't done separate OTAs to ~dem, so there would have been extra activation energy to do that. I would have done it if you guys hadn't released it that day, because nothing was working for us. For us, it was an emergency; for the rest of the network, it was not. I think you're right, @jalehman -- first of all, hotfixes should be avoided unless necessary, especially for critical infrastructure (anything involving OTAs, ping, etc.). Also, this "hotfix" was really a rewrite of the ping agent, and we need to be careful about pushing out big rewrites too.

If we had put it into ~dem, we would have been confused about why the CPU usage on the galaxy went down, but we would not have caught it ourselves. If UF had let it bake for a week, that would have caught it for the rest of the network.

@joemfb We've decided against staged rollouts to the network in the past and historically just pushed hotfixes directly to ~zod. I want to sanity-check that this is still a bad idea.

@jalehman Joe is trying to sanity-check whether it would ever make sense to have different users be on different releases, even in their Clay revision history. Tlon infrastructure has been the major infrastructure, but now there are several, so we now can't be sure that a fix in any one host's environment not causing some regression in other environments. This means we need to be more conservative about pushing things out to the entire network.

@philipcmonk Within Tlon we've been thinking that we should hold back %base OTAs until the right set of people are awake who could debug them, and also to do staged releases within our own infrastructure. We've decided we should wait until the right people are awake, then deploy to 5% of people and see how it goes, then push it out to the rest. Within the context of ships that have IPs that are public but change, that use a swapfile, etc., which are all particular to our context, we have better infrastructure for testing this than anyone else. Since the hotfix was just for us,

@jalehman Holium already does this, because they've been burned on OTAs in the past. They now have a policy of not applying OTAs until they've done their own testing. Their perspective is that %base OTAs can't be trusted, and it's unacceptable for OTAs to be that unreliable.

@joemfb That suggests an interesting possible compromise: rather than forking %base, release a pre-release more aggressively. It's still in the linear chain, but it's published earlier so more people can run it.

@philipcmonk Yeah, and then we'd load the prerelease onto ~dem. My guess is there was some stuff on the prerelease that we'd prefer to let simmer, but in this case there wasn't really anything else we would have worried about that would have stopped us from adopting a prerelease.

@belisarius222 One thing we could do is publish a release candidate to ~marnec sooner after pushing the develop branch onto core dev ships, so it has longer to bake on the network.

@philipcmonk Do we want that, or do we want

@belisarius222 It at least makes sense to have it on core dev for a day or two, but maybe not a whole week.

@philipcmonk A day or two on core dev, then a week on prerelease.

@mopfel-winrux One thing we did at Lockheed was to run our own network as a testnet. In this case, it would be a galaxy and star with its own planets that pushes out its own OTA, as part of the CI/CD pipeline.

@philipcmonk There's some value there, but it's not uncommon for bugs to show up only on mainnet, often because of historical information that's only present on mainnet. For example, the remote scry bugs all had to do with data on mainnet that wouldn't have been on a testnet started from scratch.

Ideally you'd be able to fork some ships from mainnet onto a testnet (potentially buildable), but you're still unlikely to simulate the load that you have on mainnet, and you'd also need to simulate the dynamics between old and new versions of things.

One thing we could do within Tlon hosting, though, would be to designate 100 ships across stars and galaxies and say those ships will be used exclusively for testing. Any core dev would be able to log into them, and it would be straightforward to recycle them if they break. This could be a step before core dev ships.

@mopfel-winrux So that would help us benchmark things too, if we want to see how well some performance improvement does, such as caching improvements.

@philipcmonk It sounds to me like the response to the issue went well. By the time I woke up, the fix was out and the thread was posted everywhere, which was the best you can do. The other decision that was maybe questionable was to do the ping fix as a rewrite of the existing ping code. I think ultimately that was the right decision, but it certainly increased the risk of something like this happening.

The reason I did rewrite it was that it became quite complex to try to think of everything as these two separate modes and keep track of the Jael subscriptions in both modes. I bounced off that a couple times and had an increasing sense of urgency since none of our stuff was working. I could think through the problem clearly if I thought of them as two separate systems, and the code is clearer, but doing that, especially in a hotfix, is risky on real infrastructure, and so you probably want to avoid that. I still think it was the right decision, but it's something to avoid in general.

@joemfb It was too large for me to digest immediately, and embarrassingly, there wasn't a red flag waving in my head saying "if anything goes wrong here, we will lose connectivity". I like synchronous walkthroughs: that's a fast, high-bandwidth way of reviewing something. And then long-term, obviously, we need better solutions for route discovery and maintenance.

@philipcmonk A hotfix to critical infrastructure, or maybe even to anything in %base, should require a synchronous walkthrough. Ping is marked to me as critical infrastructure, where if something goes wrong with it, it's the end of the world. The name "ping" is unassuming, though, so I should have flagged that. Any change to critical infrastructure should say so in the PR.

My broader takeaway is that we knew this ping problem was coming up for weeks ahead of time. I was largely distracted by the remote scry stuff, and it came up a little faster than we expected it to. We should try to get out ahead of scaling problems when we know they're coming, and for this one, we knew that we were within a month or two of it at least.

Tlon hosting could try to boot a thousand ships at a time periodically just to see if things break. Another thing we're planning on doing is inactivating ships that aren't in use until people log into them. This helps with scaling problems, but it also lets us, once a month or something, turn on the whole fleet for a day or so and that would give you advance notice of some scaling problems at least.

@joemfb Do you have any ideas why it hit such a cliff? I would also have expected it to degrade linearly.

@philipcmonk If your queue is entirely full, where your lane scrys can't come in every two minutes for each ship you're routing for, then nothing in your cache is from within the last two minutes, so you'll miss every forward.

@joemfb This is interesting because we deployed these things at almost the same time: shrinking the ames queue dramatically and the problematic hotfix. Shrinking the queue might have alleviated the problem from the hotfix.

@philipcmonk Vere 2.10 came out a bit earlier than the Arvo hotfix, but we didn't put it onto our infrastructure until a bit later.

@pkova I put it on ~dem. I didn't see CPU usage change immediately, though.

@belisarius222 As for rewriting for clarity, I would say hotfix or not, we should be able to do that. It's often the case that code is basically impossible to write without making some architectural changes, and I would almost push in the opposite direction that it's good to clean things up when you fix something. There's some amount of write amplification that should be avoided when you do that -- don't rewrite all of Arvo or something -- but we should be able to release substantial changes and feel confident in that.

@philipcmonk Yeah and to feel confident in a rewrite of anything, we should have a synchronous walkthrough. Maybe that should be the policy: if it's a hotfix or a rewrite, do a walkthrough. A rewrite is usually too big to review on its own, even if it's small module, and if it's a hotfix, you want someone to look at it who isn't you.

@joemfb Yeah I agree. It would have been just as easy to introduce a problem here if you didn't rewrite it.

@philipcmonk It would have been especially worrying for the cases where your sponsor breached or changed sponsors. The Jael logic is simpler than it used to be, but it would have been hard to test that in both separate cases and so on.

@joemfb And in the big picture we just have to do whatever we can to simplify these subsystems.

@philipcmonk Yeah.

@joemfb The intersection of Ames and Jael is bad enough, and I almost never think of the fact that Ping is critical infrastructure in that dimension as well, in addition to route maintenance.

It's not clear that we should add a random deployment delay at the moment, since it could be a lot of engineering work to build. It would look something like `?: =(0 (mod (mug [our now release-hash]) 4)) upgrade dont-upgrade`. Or it could add random delay, based exclusively on `eny`. ## Planned Process Improvements - We will require synchronous code walkthroughs for any change to critical infrastructure. This includes, but is not limited to, the OTA system, the ping app, Ames, and anything that has to do with routing and OTAs especially. Anything that could interfere with OTAs, which includes routing, some aspects of Jael, and some other modules, is critical, because if OTAs break, users can only get fixes for other problems out of band. For any release, we need to think carefully as a group about which changes in the release constitute critical infrastructure. Any changes to critical infrastructure need to be marked as such in the release checklist. @jalehman to add release checklist to the release template. @belisarius222 to add the following to a PR template: - docs changes - release notes - critical infrastructure changes - checks to be added to the release checklist - description of which tests were performed by the author - We should be less willing to perform hotfixes when there are other options available. For example, if this happened again, UF should have put out a release candidate with the hotfix but let it bake for a week before pushing to master. Tlon would have released it on their own infrastructure to solve their problem, but without incurring risk for the rest of the network. - We should look into adding pre-core-dev testing phase of the devstream process, involving a fleet of real ships, possibly in Tlon hosting, that run the new code before anyone uses it on their personal ship.
belisarius222 commented 1 year ago

It should be noted that 200 ships appear to be lost from this bug. We should find out how many of those are under ~dem, and also how many were ships that were shut down for other reasons by Tlon. We might also see a different number the next time the network explorer does its twice-daily ping.