Closed cgwalters closed 3 years ago
We also had problems in the past with applying config changes to /etc
(including the static pod manifest) while the static pod was still running; I tried looking at git log --grep=etcd
but didn't see it. If we change the flow to ensure we're stopping etcd before we apply changes on the filesystem in /etc
, that would also have fixed that earlier problem.
(Bigger picture we're being pulled in two different directions here - https://github.com/openshift/enhancements/pull/159 is all about "apply changes without rebooting!" whereas this direction is more "make updates more transactional as part of the reboot!")
Apply updates without rebooting is not a system design goal yet :)
The design goal is “upgrade is completely non disruptive to the system”. Right now it’s possible that static pods like etcd might actually have to be shut down prior to any disruption. Note that it’s not clear what is going on yet.
Yep, we're in agreement there! I am hopeful we can mimimize the upgrade disruption - we aren't making any attempt to do so today, and I would be not all surprised to discover the oscontainer/rpm-ostree upgrade bit is disruptive as far as block IO goes.
To flesh this out a bit today, we implemented a lot of sophisticated logic as part of zincati that the MCO can use too. The basic thing is - if we happen to be interrupted after "staging" an update, we only want to actually have it take effect on reboot at the very end. If we e.g. fail to drain, or the kernel happens to crash and we get rebooted, we don't want that pending update to take effect!
See https://github.com/coreos/rpm-ostree/pull/1814 and https://github.com/coreos/zincati/blob/50d51ce97c63fd35fac9bf5c702e4fe299c156c6/src/rpm_ostree/cli_finalize.rs#L21
IOW the flow is like this:
if diff.oscontainer != "" {
rpm_ostree_upgrade(diff.oscontainer, "--lock-finalization")
}
drain()
write_config_changes(diff.filesystem)
stop_kubelet()
finalize_deployment(diff.oscontainer)
Where only at that very last step do we perform the swap of the bootloader config (and that's basically it).
Update after triage from etcd team.
It appears that we have a direct correlation to increased I/O latencies leading to possible control-plane disruption and the chain of events that happens during upgrade.
steps performed for data experiment
results as observed from sbatsche-atop10-2020-vrpt6-master-1
etcd_disk_wal_fsync_duration_seconds_bucket
and etcd_disk_backend_commit_duration_seconds_bucket
spike.podman pull -q ...
starts disks report 61% busy r/0 w/1847 MBw/s 266.3 avio 0.55 mskworker/u8:4+fl
starts disks report 78% busy r/0 w/14900 MBw/s 242.1 avio 0.32 ms/usr/bin/rpm-ostree start-daemon
starts and instantly disks report 77% busy r/981 w/57 avio 0.74 ms/usr/bin/rpm-ostree start-daemon
running disks report 78% busy r/404 w/1234 MBw/s 257.0 avio 0.42 ms/usr/bin/rpm-ostree start-daemon
running disks report 91% busy r/853 w/1268 MBw/s 229.7 avio 0.41 ms/usr/bin/rpm-ostree start-daemon
running disks report 95% busy r/1162 w/75 MBw/s 229.7 avio 0.69 ms/usr/bin/crio wipe
starts disks report 87% busy r/896 w/281 MBw/s 8.9 avio 0.74 mskube-scheduler
spikes disks report 73% busy r/1124 w/0 MBr/s 27.3 | MBw/s 3.1 avio 0.64 msconclusion/open question: can we perform these actions in a way that less disruption to fs.
OK so I'm playing with things like IOWeight=
and IOSchedulingClass=idle
etc. (see e.g. https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html http://0pointer.de/blog/projects/resources.html ) and...none of it is having any effect.
I'm playing with this etcd fio job: https://github.com/etcd-io/etcd/issues/10577#issuecomment-475624306
(Does etcd really use O_DIRECT
? I need to verify that)
Also as far as I can tell, we're basically not exposing or using filesystem/block IO cgroups to pods right? (Basically just cpu/memory)
So this issue of balancing etcd with ostree is just a subset of the generic I/O control for workloads. I think this is related to cgroups v2: https://andrestc.com/post/cgroups-io/ - definitely for ostree we use buffered writes. We may have to just manually tune down our write speed.
We have discussed throttling image pulls as well in podman/crio till we can use cgroups v2 features.
I am working on a writeup around this in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA
Run a HTTP server (kube service) that mounts the oscontainer
This makes a lot of sense to me and clearly scales better with the size of the cluster. (We could even go fancier in the future and have the service generate static deltas to trade CPU cycles on one node for even more I/O efficiency across the cluster; and at least CPU cycles are more easily accountable/schedulable.)
Another advantage of that approach is that given that network speed is normally slower than disk speed, you also automatically get less demanding disk I/O usage.
More updates in the doc but to summarize: It seems that what ostree is doing today can introduce e.g. 2 second latency for etcd fdatasync. We can rework things so we fsync as we go, avoiding a single big flush. This slows down the update a lot, but also greatly reduces the max latency (from ~2s to .46s).
In other words, if we take this from:
fsync/fdatasync/sync_file_range:
sync (usec): min=732, max=1935.7k, avg=2594.14, stdev=19014.96
sync percentiles (usec):
| 1.00th=[ 832], 5.00th=[ 881], 10.00th=[ 914], 20.00th=[ 971],
| 30.00th=[ 1020], 40.00th=[ 1090], 50.00th=[ 2933], 60.00th=[ 3032],
| 70.00th=[ 3064], 80.00th=[ 3130], 90.00th=[ 3228], 95.00th=[ 3425],
| 99.00th=[ 6194], 99.50th=[ 7963], 99.90th=[ 17433], 99.95th=[ 95945],
| 99.99th=[843056]
to:
fsync/fdatasync/sync_file_range:
sync (usec): min=741, max=461949, avg=3970.41, stdev=6696.28
sync percentiles (usec):
| 1.00th=[ 824], 5.00th=[ 898], 10.00th=[ 947], 20.00th=[ 1029],
| 30.00th=[ 2073], 40.00th=[ 2868], 50.00th=[ 3032], 60.00th=[ 3163],
| 70.00th=[ 4113], 80.00th=[ 6194], 90.00th=[ 8586], 95.00th=[ 10028],
| 99.00th=[ 13042], 99.50th=[ 15139], 99.90th=[ 35390], 99.95th=[ 87557],
| 99.99th=[459277]
is that sufficiently better to call this "fixed"?
Now that I look more closely we've mostly swapped having a huge stdev (and maximum) for a much worse average, and actually pushing the 99th percentile over 10ms. Argh.
TBH the 10ms is more or less arbitrary in this case. A comparison where we felt this threshold held true was with clusters without load. So don't let these results dampen hope.
Can you speak to how bad would be those potential huge 2s stalls for etcd be versus a longer period of increased latency? Does it sound likely that having huge spikes or two like that is the root cause for disruption?
when we are looking at "why are we doing this", "why does it matter" we need to consider the following. The latency on disk is directly related to election timeouts (leader elections) as the time in which it takes to persist raft transactions to the WAL file is very high. So if we go to 2s fsync with for example Azure, where the election timeout is 2500ms we are pushing the cluster towards timeout.
Where if instead, we have much lower yet elevated latency even over a longer time series we could see api with some slowness but chances are far less we introduce election as a result. So the net gain is less disruption. Also consider the operators who use leader election. They must update TTL every 10s or go through leader election process. So if etcd has huge spike in latency it could result in an operator perhaps not being able to update configmap in time.
In short its easier for etcd to tolerate elevated latency over a period of time vs very heavy spike.
Here's another sub-thread though: Since we know we're very shortly going to be rebooting this machine, would it help to tell etcd "if this node is a leader, please re-elect a different one"? Or to rephrase, are latency spikes much worse of a problem on the leader?
Here's another sub-thread though: Since we know we're very shortly going to be rebooting this machine, would it help to tell etcd "if this node is a leader, please re-elect a different one"? Or to rephrase, are latency spikes much worse of a problem on the leader?
This is a good question, I think it has been discussed before elsewhere in more depth. I think @hexfusion has some input here re: actively designating a leader intelligently (if we can predict the reboot order)
Yeah the net result of rolling reboots will be leader change right it is going to happen, leader will go down. When etcd gets SIGTERM it will try to gracefully transfer leadership vs election. But the result is still "leader change" which results in a fail-fast to clients to retry (TTL resets etc). Because technically there is a time where we have no leader even if just a few ms. So while we expect 1 leader change what we see as a result here can be multiple.
So while we expect 1 leader change what we see as a result here can be multiple.
Hmm...actually there can be anywhere between 1 and 3 elections - if the MCO happens to choose the leader each time to upgrade. Right? So I think perhaps instead of https://github.com/openshift/cluster-etcd-operator/issues/392 the MCO should prefer upgrading followers first...right?
Hmm...actually there can be anywhere between 1 and 3 elections - if the MCO happens to choose the leader each time to upgrade. Right? So I think perhaps instead of openshift/cluster-etcd-operator#392 the MCO should prefer upgrading followers first...right?
I don't disagree that chance is currently involved. Logically choosing non-leader to upgrade first would help hedge those bets.
I don't disagree that chance is currently involved. Logically choosing non-leader to upgrade first would help hedge those bets.
I can say for sure that the MCO code today makes no attempt to apply any kind of ordering/priority to the nodes it's going to update. It could in fact be effectively random.
Adding a little bit of logic here for the control plane to avoid the etcd leader should be pretty easy...is the current leader reflected anywhere in the API? I am not seeing it in oc describe clusteroperator/etcd
or oc get etcd
etc.
Adding a little bit of logic here for the control plane to avoid the etcd leader should be pretty easy...is the current leader reflected anywhere in the API? I am not seeing it in oc describe clusteroperator/etcd or oc get etcd etc.
this seems like a very easy thing to add on our end, either via status field or possibly a condition — hard for me to say offhand what kind of disclaimers such a field might need (e.g. should it be considered more of a "hint" than a super accurate source of truth)
this seems like a very easy thing to add on our end, either via status field or possibly a condition — hard for me to say offhand what kind of disclaimers such a field might need (e.g. should it be considered more of a "hint" than a super accurate source of truth)
That'd be great! I think this is clearly just a "hint". (It's kind of...funny...storing etcd state in the API because that goes back into etcd...but eh)
(It's kind of...funny...storing etcd state in the API because that goes back into etcd...but eh)
😂oh yeah... ♾
@cgwalters @hexfusion moving the API discussion here https://github.com/openshift/api/pull/694
OK so here's my recommendations so far:
1) Ensure the MCO only upgrades etcd followers by default until there is only the leader left (I suspect this alone will be a big improvement) - this should be really easy to implement once we have the above etcd API (and if we have to backport it we can directly talk to the etcd gRPC API too or something)
2) Land https://github.com/ostreedev/ostree/pull/2152
3) Change the MCO to use bfq
for the control plane only (to start), and also to inject IOSchedulingClass=idle
for rpm-ostreed.service
(proposed upstream), as well as the oscontainer pull ionice -c 3 podman pull
(or soon ionice -c 3 oc image extract ...
).
1) Serve oscontainer via HTTP and avoid pulling it per node (this would get us delta updates for free)
2) Use cgroups v2 for IO control (this is the only sustainable solution for schedulable control planes)
3) Investigate using real-time scheduling for etcd probably SCHED_FIFO
@cgwalters with your WIP (https://github.com/openshift/machine-config-operator/pull/1946) have you been able to measure the before/after impact of smarter reboot ordering?
Re: the possible API addition (https://github.com/openshift/api/pull/694) it might be too late in the release to press forward this week, but having some data on what we stand to gain from some solution would be really useful and something we haven't gotten around to yet. I wonder if your WIP helps us get that data faster...
The tricky thing is: it seems useful to measure success (at least partially) by number of leader elections. And if we do that we need the MCO to be doing the right thing in the first place.
Anyways like I said, I think we can just hack this into the MCO code directly for now, so we're not blocked.
One thing I want to note on this: The "etcd leader awareness" part of this is probably going to be nontrivial in general, and might get into backport conflicts.
But I think we can pretty easily backport some tweaks/hacks to make the oscontainer update process much nicer even without that (and without the ostree per-object-fsync
patch). If need be we can e.g. do something crude like strace -e write --inject write=delay_enter:5 $(pidof rpm-ostreed))
or so.
I had been thinking that the existing e2e-gcp-upgrade
job would be a useful proxy for this, but it turns out I'd been misled by https://github.com/openshift/machine-config-operator/issues/1964 into thinking we were doing an OS update when we aren't actually.
And this really gets into the point that nothing in the MCO jobs today is actually testing OS updates. The periodics run post-merge which test upgrades between nightly release images do cover this though of course.
We could fix both problems at once by enhancing e2e-gcp-upgrade
to synthesize a nontrivially-sized OS update too. I actually wrote some code for this as part of https://github.com/ostreedev/ostree/pull/2127 - basically mutate (in a benign way) a subset of ELF files is an easy way to do this.
The tricky thing is: it seems useful to measure success (at least partially) by number of leader elections. And if we do that we need the MCO to be doing the right thing in the first place.
Anyways like I said, I think we can just hack this into the MCO code directly for now, so we're not blocked.
Would it help if we were able to see the p99 fsync and commit metrics aggregated across a meaningfully sized set of CI runs that included the nontrivial OS upgrade? Absolute election count is one important metric, but it seems to me that we're also interested in knowing how much headroom we're providing to assess the risk of inducing election cycles in the near future even if an election didn't actually happen during the test.
Can/should we come up with some clear SLOs against these metrics for some specific test suites?
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
See https://bugzilla.redhat.com/show_bug.cgi?id=1850057
(This content is canonically stored at https://github.com/cgwalters/workboard/tree/master/openshift/bz1850057-etcd-osupdate )
OS upgrade I/O competes with etcd
Currently the MCD does:
Now "drain" keeps both daemonsets and static pods running. Of those two, etcd is a static pod today. When we're applying OS updates, that can be a lot of I/O and (reportedly) compete with etcd.
We have two options:
I like option 2) better because we've put a whole lot of work into making the ostree stack support this "stage updates while system is running" and it'd be cool if OpenShift used it :smile: Another way to say this is - I think we want to minimize the time window in which the etcd cluster is missing a member, so the more work we can do while etcd is still running the better!
Links
Workboard:
Pull request in progress: https://github.com/openshift/machine-config-operator/pull/1957
Example failing jobs:
Note that both of those jobs jumped from RHEL 8.1 to 8.2.
Prometheus queries