cgwalters commented 4 years ago

See https://bugzilla.redhat.com/show_bug.cgi?id=1850057

(This content is canonically stored at https://github.com/cgwalters/workboard/tree/master/openshift/bz1850057-etcd-osupdate )

OS upgrade I/O competes with etcd

Currently the MCD does:

drain
apply updates
reboot

Now "drain" keeps both daemonsets and static pods running. Of those two, etcd is a static pod today. When we're applying OS updates, that can be a lot of I/O and (reportedly) compete with etcd.

We have two options:

kill etcd after draining (or even more strongly, stop kubelet)
"stage" updates gracefully while everything is running

I like option 2) better because we've put a whole lot of work into making the ostree stack support this "stage updates while system is running" and it'd be cool if OpenShift used it :smile: Another way to say this is - I think we want to minimize the time window in which the etcd cluster is missing a member, so the more work we can do while etcd is still running the better!

Links

Workboard:

Pull request in progress: https://github.com/openshift/machine-config-operator/pull/1957

Example failing jobs:

Note that both of those jobs jumped from RHEL 8.1 to 8.2.

Prometheus queries

histogram_quantile(0.99, sum(irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)) 
histogram_quantile(0.99, sum(irate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))

cgwalters commented 4 years ago

We also had problems in the past with applying config changes to /etc (including the static pod manifest) while the static pod was still running; I tried looking at git log --grep=etcd but didn't see it. If we change the flow to ensure we're stopping etcd before we apply changes on the filesystem in /etc, that would also have fixed that earlier problem.

(Bigger picture we're being pulled in two different directions here - https://github.com/openshift/enhancements/pull/159 is all about "apply changes without rebooting!" whereas this direction is more "make updates more transactional as part of the reboot!")

smarterclayton commented 4 years ago

Apply updates without rebooting is not a system design goal yet :)

The design goal is “upgrade is completely non disruptive to the system”. Right now it’s possible that static pods like etcd might actually have to be shut down prior to any disruption. Note that it’s not clear what is going on yet.

cgwalters commented 4 years ago

Yep, we're in agreement there! I am hopeful we can mimimize the upgrade disruption - we aren't making any attempt to do so today, and I would be not all surprised to discover the oscontainer/rpm-ostree upgrade bit is disruptive as far as block IO goes.

cgwalters commented 4 years ago

To flesh this out a bit today, we implemented a lot of sophisticated logic as part of zincati that the MCO can use too. The basic thing is - if we happen to be interrupted after "staging" an update, we only want to actually have it take effect on reboot at the very end. If we e.g. fail to drain, or the kernel happens to crash and we get rebooted, we don't want that pending update to take effect!

See https://github.com/coreos/rpm-ostree/pull/1814 and https://github.com/coreos/zincati/blob/50d51ce97c63fd35fac9bf5c702e4fe299c156c6/src/rpm_ostree/cli_finalize.rs#L21

IOW the flow is like this:

if diff.oscontainer != "" {
  rpm_ostree_upgrade(diff.oscontainer, "--lock-finalization")
}
drain()
write_config_changes(diff.filesystem)
stop_kubelet()
finalize_deployment(diff.oscontainer)

Where only at that very last step do we perform the swap of the bootloader config (and that's basically it).

hexfusion commented 4 years ago

Update after triage from etcd team.

It appears that we have a direct correlation to increased I/O latencies leading to possible control-plane disruption and the chain of events that happens during upgrade.

steps performed for data experiment

manually created GCP IPI cluster 4.4.11
deployed systemd unit for atop monitoring via MachineConfig https://gist.github.com/hexfusion/400d43d511a967814f5ccd8a7e09db8c
performed upgrade to 4.5-rc-6
to ensure atop time series datafile was preserved over upgrade I copied it to ~root/ right before node reboot.
observed metrics from cluster noting timestamp for I/O spike
stepped through atop timeseries and observed load during spike.

results as observed from sbatsche-atop10-2020-vrpt6-master-1

approx 8:49 p99 fpr etcd metrics etcd_disk_wal_fsync_duration_seconds_bucket and etcd_disk_backend_commit_duration_seconds_bucket spike.

peaking I/O activity during this time series, in general, I/O remains higher

podman pull (update payload pull?)

atop 8:48:55 podman pull -q ... starts disks report 61% busy r/0 w/1847 MBw/s 266.3 avio 0.55 ms

kworker/u8:4+fl

atop 8:48:27 kworker/u8:4+fl starts disks report 78% busy r/0 w/14900 MBw/s 242.1 avio 0.32 ms

rpm-ostree

atop 8:49:21 /usr/bin/rpm-ostree start-daemon starts and instantly disks report 77% busy r/981 w/57 avio 0.74 ms
atop 8:49:25 /usr/bin/rpm-ostree start-daemon running disks report 78% busy r/404 w/1234 MBw/s 257.0 avio 0.42 ms
atop 8:49:26 /usr/bin/rpm-ostree start-daemon running disks report 91% busy r/853 w/1268 MBw/s 229.7 avio 0.41 ms
atop 8:49:27 /usr/bin/rpm-ostree start-daemon running disks report 95% busy r/1162 w/75 MBw/s 229.7 avio 0.69 ms

crio

atop 8:49:27 /usr/bin/crio wipe starts disks report 87% busy r/896 w/281 MBw/s 8.9 avio 0.74 ms

kube scheduler

atop 8:52:21 kube-scheduler spikes disks report 73% busy r/1124 w/0 MBr/s 27.3 | MBw/s 3.1 avio 0.64 ms

conclusion/open question: can we perform these actions in a way that less disruption to fs.

jlebon commented 4 years ago

https://github.com/coreos/rpm-ostree/pull/2158

cgwalters commented 4 years ago

OK so I'm playing with things like IOWeight= and IOSchedulingClass=idle etc. (see e.g. https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html http://0pointer.de/blog/projects/resources.html ) and...none of it is having any effect.

I'm playing with this etcd fio job: https://github.com/etcd-io/etcd/issues/10577#issuecomment-475624306

(Does etcd really use O_DIRECT? I need to verify that)

Also as far as I can tell, we're basically not exposing or using filesystem/block IO cgroups to pods right? (Basically just cpu/memory)
So this issue of balancing etcd with ostree is just a subset of the generic I/O control for workloads. I think this is related to cgroups v2: https://andrestc.com/post/cgroups-io/ - definitely for ostree we use buffered writes. We may have to just manually tune down our write speed.

mrunalp commented 4 years ago

We have discussed throttling image pulls as well in podman/crio till we can use cgroups v2 features.

cgwalters commented 4 years ago

I am working on a writeup around this in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA

jlebon commented 4 years ago

Run a HTTP server (kube service) that mounts the oscontainer

This makes a lot of sense to me and clearly scales better with the size of the cluster. (We could even go fancier in the future and have the service generate static deltas to trade CPU cycles on one node for even more I/O efficiency across the cluster; and at least CPU cycles are more easily accountable/schedulable.)

jlebon commented 4 years ago

Another advantage of that approach is that given that network speed is normally slower than disk speed, you also automatically get less demanding disk I/O usage.

cgwalters commented 4 years ago

More updates in the doc but to summarize: It seems that what ostree is doing today can introduce e.g. 2 second latency for etcd fdatasync. We can rework things so we fsync as we go, avoiding a single big flush. This slows down the update a lot, but also greatly reduces the max latency (from ~2s to .46s).

In other words, if we take this from:

  fsync/fdatasync/sync_file_range:
    sync (usec): min=732, max=1935.7k, avg=2594.14, stdev=19014.96
    sync percentiles (usec):
     |  1.00th=[   832],  5.00th=[   881], 10.00th=[   914], 20.00th=[   971],
     | 30.00th=[  1020], 40.00th=[  1090], 50.00th=[  2933], 60.00th=[  3032],
     | 70.00th=[  3064], 80.00th=[  3130], 90.00th=[  3228], 95.00th=[  3425],
     | 99.00th=[  6194], 99.50th=[  7963], 99.90th=[ 17433], 99.95th=[ 95945],
     | 99.99th=[843056]

to:

  fsync/fdatasync/sync_file_range:
    sync (usec): min=741, max=461949, avg=3970.41, stdev=6696.28
    sync percentiles (usec):
     |  1.00th=[   824],  5.00th=[   898], 10.00th=[   947], 20.00th=[  1029],
     | 30.00th=[  2073], 40.00th=[  2868], 50.00th=[  3032], 60.00th=[  3163],
     | 70.00th=[  4113], 80.00th=[  6194], 90.00th=[  8586], 95.00th=[ 10028],
     | 99.00th=[ 13042], 99.50th=[ 15139], 99.90th=[ 35390], 99.95th=[ 87557],
     | 99.99th=[459277]

is that sufficiently better to call this "fixed"?

cgwalters commented 4 years ago

Now that I look more closely we've mostly swapped having a huge stdev (and maximum) for a much worse average, and actually pushing the 99th percentile over 10ms. Argh.

hexfusion commented 4 years ago

TBH the 10ms is more or less arbitrary in this case. A comparison where we felt this threshold held true was with clusters without load. So don't let these results dampen hope.

cgwalters commented 4 years ago

Can you speak to how bad would be those potential huge 2s stalls for etcd be versus a longer period of increased latency? Does it sound likely that having huge spikes or two like that is the root cause for disruption?

hexfusion commented 4 years ago

when we are looking at "why are we doing this", "why does it matter" we need to consider the following. The latency on disk is directly related to election timeouts (leader elections) as the time in which it takes to persist raft transactions to the WAL file is very high. So if we go to 2s fsync with for example Azure, where the election timeout is 2500ms we are pushing the cluster towards timeout.

Where if instead, we have much lower yet elevated latency even over a longer time series we could see api with some slowness but chances are far less we introduce election as a result. So the net gain is less disruption. Also consider the operators who use leader election. They must update TTL every 10s or go through leader election process. So if etcd has huge spike in latency it could result in an operator perhaps not being able to update configmap in time.

In short its easier for etcd to tolerate elevated latency over a period of time vs very heavy spike.

cgwalters commented 4 years ago

Here's another sub-thread though: Since we know we're very shortly going to be rebooting this machine, would it help to tell etcd "if this node is a leader, please re-elect a different one"? Or to rephrase, are latency spikes much worse of a problem on the leader?

ironcladlou commented 4 years ago

Here's another sub-thread though: Since we know we're very shortly going to be rebooting this machine, would it help to tell etcd "if this node is a leader, please re-elect a different one"? Or to rephrase, are latency spikes much worse of a problem on the leader?

This is a good question, I think it has been discussed before elsewhere in more depth. I think @hexfusion has some input here re: actively designating a leader intelligently (if we can predict the reboot order)

cgwalters commented 4 years ago

Moved that to https://github.com/openshift/cluster-etcd-operator/issues/392

hexfusion commented 4 years ago

Yeah the net result of rolling reboots will be leader change right it is going to happen, leader will go down. When etcd gets SIGTERM it will try to gracefully transfer leadership vs election. But the result is still "leader change" which results in a fail-fast to clients to retry (TTL resets etc). Because technically there is a time where we have no leader even if just a few ms. So while we expect 1 leader change what we see as a result here can be multiple.

cgwalters commented 4 years ago

So while we expect 1 leader change what we see as a result here can be multiple.

Hmm...actually there can be anywhere between 1 and 3 elections - if the MCO happens to choose the leader each time to upgrade. Right? So I think perhaps instead of https://github.com/openshift/cluster-etcd-operator/issues/392 the MCO should prefer upgrading followers first...right?

hexfusion commented 4 years ago

Hmm...actually there can be anywhere between 1 and 3 elections - if the MCO happens to choose the leader each time to upgrade. Right? So I think perhaps instead of openshift/cluster-etcd-operator#392 the MCO should prefer upgrading followers first...right?

I don't disagree that chance is currently involved. Logically choosing non-leader to upgrade first would help hedge those bets.

cgwalters commented 4 years ago

I don't disagree that chance is currently involved. Logically choosing non-leader to upgrade first would help hedge those bets.

I can say for sure that the MCO code today makes no attempt to apply any kind of ordering/priority to the nodes it's going to update. It could in fact be effectively random.

Adding a little bit of logic here for the control plane to avoid the etcd leader should be pretty easy...is the current leader reflected anywhere in the API? I am not seeing it in oc describe clusteroperator/etcd or oc get etcd etc.

ironcladlou commented 4 years ago

Adding a little bit of logic here for the control plane to avoid the etcd leader should be pretty easy...is the current leader reflected anywhere in the API? I am not seeing it in oc describe clusteroperator/etcd or oc get etcd etc.

this seems like a very easy thing to add on our end, either via status field or possibly a condition — hard for me to say offhand what kind of disclaimers such a field might need (e.g. should it be considered more of a "hint" than a super accurate source of truth)

cgwalters commented 4 years ago

this seems like a very easy thing to add on our end, either via status field or possibly a condition — hard for me to say offhand what kind of disclaimers such a field might need (e.g. should it be considered more of a "hint" than a super accurate source of truth)

That'd be great! I think this is clearly just a "hint". (It's kind of...funny...storing etcd state in the API because that goes back into etcd...but eh)

ironcladlou commented 4 years ago

(It's kind of...funny...storing etcd state in the API because that goes back into etcd...but eh)

😂oh yeah... ♾

ironcladlou commented 4 years ago

@cgwalters @hexfusion moving the API discussion here https://github.com/openshift/api/pull/694

cgwalters commented 4 years ago

OK so here's my recommendations so far:

1) Ensure the MCO only upgrades etcd followers by default until there is only the leader left (I suspect this alone will be a big improvement) - this should be really easy to implement once we have the above etcd API (and if we have to backport it we can directly talk to the etcd gRPC API too or something) 2) Land https://github.com/ostreedev/ostree/pull/2152 3) Change the MCO to use bfq for the control plane only (to start), and also to inject IOSchedulingClass=idle for rpm-ostreed.service (proposed upstream), as well as the oscontainer pull ionice -c 3 podman pull (or soon ionice -c 3 oc image extract ...).

Later work

1) Serve oscontainer via HTTP and avoid pulling it per node (this would get us delta updates for free) 2) Use cgroups v2 for IO control (this is the only sustainable solution for schedulable control planes) 3) Investigate using real-time scheduling for etcd probably SCHED_FIFO

ironcladlou commented 4 years ago

@cgwalters with your WIP (https://github.com/openshift/machine-config-operator/pull/1946) have you been able to measure the before/after impact of smarter reboot ordering?

Re: the possible API addition (https://github.com/openshift/api/pull/694) it might be too late in the release to press forward this week, but having some data on what we stand to gain from some solution would be really useful and something we haven't gotten around to yet. I wonder if your WIP helps us get that data faster...

cgwalters commented 4 years ago

The tricky thing is: it seems useful to measure success (at least partially) by number of leader elections. And if we do that we need the MCO to be doing the right thing in the first place.

Anyways like I said, I think we can just hack this into the MCO code directly for now, so we're not blocked.

cgwalters commented 4 years ago

One thing I want to note on this: The "etcd leader awareness" part of this is probably going to be nontrivial in general, and might get into backport conflicts.

But I think we can pretty easily backport some tweaks/hacks to make the oscontainer update process much nicer even without that (and without the ostree per-object-fsync patch). If need be we can e.g. do something crude like strace -e write --inject write=delay_enter:5 $(pidof rpm-ostreed)) or so.

cgwalters commented 4 years ago

I had been thinking that the existing e2e-gcp-upgrade job would be a useful proxy for this, but it turns out I'd been misled by https://github.com/openshift/machine-config-operator/issues/1964 into thinking we were doing an OS update when we aren't actually.

And this really gets into the point that nothing in the MCO jobs today is actually testing OS updates. The periodics run post-merge which test upgrades between nightly release images do cover this though of course.

We could fix both problems at once by enhancing e2e-gcp-upgrade to synthesize a nontrivially-sized OS update too. I actually wrote some code for this as part of https://github.com/ostreedev/ostree/pull/2127 - basically mutate (in a benign way) a subset of ELF files is an easy way to do this.

ironcladlou commented 4 years ago

The tricky thing is: it seems useful to measure success (at least partially) by number of leader elections. And if we do that we need the MCO to be doing the right thing in the first place.

Anyways like I said, I think we can just hack this into the MCO code directly for now, so we're not blocked.

Would it help if we were able to see the p99 fsync and commit metrics aggregated across a meaningfully sized set of CI runs that included the nontrivial OS upgrade? Absolute election count is one important metric, but it seems to me that we're also interested in knowing how much headroom we're providing to assess the risk of inducing election cycles in the near future even if an election didn't actually happen during the test.

Can/should we come up with some clear SLOs against these metrics for some specific test suites?

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/machine-config-operator/issues/1897#issuecomment-762015662): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift / machine-config-operator

Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

OS upgrade I/O competes with etcd

Links

Example failing jobs:

Prometheus queries

peaking I/O activity during this time series, in general, I/O remains higher

podman pull (update payload pull?)

kworker/u8:4+fl

rpm-ostree

crio

kube scheduler

Run a HTTP server (kube service) that mounts the oscontainer

Later work