oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 39 forks source link

update to Cockroach 22.2.0 #2017

Open davepacheco opened 1 year ago

davepacheco commented 1 year ago

Cockroach have just released v22.2.0 with Go 1.19.1. It'd be good to update to this as soon as convenient.

ahl commented 10 months ago

Do we want to consider running CRDB within a Linux VM?

jclulow commented 10 months ago

No I don't think so. From my perspective the original reasons still apply:

jclulow commented 10 months ago

NB: wrt. the dependencies, Node and Go are already maintained by other people (e.g., pkgsrc and OmniOS both ship up to date versions, and we pull updates from OmniOS into Helios) it's "just" the actual Cockroach build that needs the bigger work.

davepacheco commented 10 months ago

I don't think so. I don't think much about the original calculus has changed.

There's a bunch of stuff that we get "for free" by running this as a Helios zone that we'd have to re-implement for guest VMs:

To be clear, we need all of this for whatever environment we deploy our software in. It's just that we're already doing all of this for Helios zones. It'd be quite a lot of work to build parallel infrastructure for Linux VMs.

What have been / will be the costs to running on Helios?

  1. Some amount of work to keep CockroachDB building on Helios. Modulo Bazel (see below), I think this is fairly small (certainly much smaller than the above). To get a sense of our local delta, you can see our local patches here.
  2. The cost of debugging illumos#15254 and friends. This was considerable. But I'm not sure what conclusion to draw from it. First, it was a general OS problem, and we still need the OS to work, so it's investment in more than just CockroachDB. Also, Linux had the same kind of problem; it's just that it was found and fixed by others already. That's a potential advantage in itself. Of course, if we had a similarly nasty problem again, on Linux I think we've got a lot less expertise and tooling to debug it effectively.
  3. The Bazel thing. I expect this to be largely a one-time cost (though there may be some ongoing maintenance too). We don't know how big yet. I would be surprised if it were even close to as large as the above. There would be ongoing maintenance either way (for our parallel infrastructure or for the illumos support in Bazel). Again, my guess is Bazel will be easier here because it's pretty separate from everything else. Just the sled agent support for deploying zones alone has been a tricky area with lots of churn; I can't imagine also maintaining a version of that for VMs as we keep iterating on it.

Basically, I think switching would be a ton of up-front work and a bunch of ongoing work that the whole team would keep having to deal with (i.e., everyone will have to deal with the VM and the infrastructure around it); whereas I think (but don't know!) that doing the Bazel work would be much less work up front, less ongoing, and only the people keeping it up would have to deal with it.

citrus-it commented 10 months ago

An illumos port of Bazel recently landed in OmniOS. I've opened a PR to pull this into Helios at https://github.com/oxidecomputer/helios-omnios-extra/pull/3 More details over there.

iliana commented 7 months ago

Moving to the 8 milestone aspirationally; it's likely if we want to eventually get to Cockroach v23.2 we'll want to go one release at a time, and it'd be best to do that sooner rather than later.

Fortunately it seems to build now, some notes at https://github.com/oxidecomputer/garbage-compactor/pull/7#issuecomment-2038998312. I've uploaded my build to /staff/iliana/cockroach-v22.2.19.illumos.tar.gz.

However there are a significant number of test failures after introducing v22.2, caused by trying to set the first blueprint as the target blueprint. The blueprints InsertTargetQuery CTE is returning the no-such-blueprint sentinel for reasons I can't figure out quite yet. (cargo nextest run -p nexus-db-queries -- db::datastore::deployment::tests::test_representative_blueprint is the most trivial way I've found to reproduce this.)

thread 'db::datastore::deployment::tests::test_representative_blueprint' panicked at nexus/db-queries/src/db/datastore/deployment.rs:1457:14:
called `Result::unwrap()` on an `Err` value: ObjectNotFound { type_name: Blueprint, lookup_type: ById(5234919e-1005-441a-9575-614305173dbe) }
iliana commented 7 months ago

5434 fixes the issue I mention above. All tests pass on v22.2.19 with that change.