k3s-io / kine

Run Kubernetes on MySQL, Postgres, sqlite, dqlite, not etcd.
Apache License 2.0
1.54k stars 231 forks source link

[nats] NATS backend rewrite, perf and stability improvements #194

Closed bruth closed 9 months ago

bruth commented 1 year ago

@brandond Quick question on desired Update semantics. The previous implementation of the NATS backend largely assumed a single node since it held a lock during the Update call. It did a read from the NATS KV, did some checks, and then applied a KV Put, which performs a blind write.

However, in a clustered setup, e.g. three k3s servers, each will be watching, competing for leases, etc. so interleaved writes can definitely be attempted. NATS KV has a native Update operation that takes the "expected last sequence/version" of that key to effectively do a conditional write, to prevent last-writer-wins behavior, i.e. serializability per key.

All that said, while watching the logs for which sets of keys get aggressively updated, it seems like it can be problematic, such as /registry/apiregistration.k8s.io/apiservices/v1beta1.metrics.k8s.io.

With the native etcd implementation in clustered mode, are updates serialized per key? Reads/watches are always going to be stale when a write is attempted, so it seems like if the loop is to attempt, then refetch, then retry, this would exacerbate the problem.

brandond commented 12 months ago

With the native etcd implementation in clustered mode, are updates serialized per key? Reads/watches are always going to be stale when a write is attempted, so it seems like if the loop is to attempt, then refetch, then retry, this would exacerbate the problem.

Kubernetes uses an etcd transaction to ensure that the key revision is the same as what was expected. If the revision matches, then the key is written to. If it does not match, then the current value is returned, and the client gets a conflict error along with a copy of the current object state. The client is expected to merge their intended changes into the current object state, and retry.

If NATS is not honoring this transaction behavior, then there is probably a significant potential for incorrect behavior.

You can find the client's transaction at https://github.com/kubernetes/kubernetes/blob/3cf3702d1e638548ba9fcfd957ebae43e1a573cb/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L527-L533

bruth commented 12 months ago

If NATS is not honoring this transaction behavior [...]

Thanks for the clarification. To confirm, the previous implementation of the backend was not honoring the transactional behavior, but this patch does.

bruth commented 9 months ago

There are a couple remaining items to clean up, but this is nearly complete. There are a number of changes, so I am happy to annotate the PR with comments on specific points of interest if that is preferred.

sttts commented 9 months ago

As the resource version semantics had been missing from the old implementation, it makes me wonder whether Kube conformance tests are run on any of these backends, especially new ones. There are pretty extensive tests around watches and resource versions in there. They would likely have exposed the missing RV support.

bruth commented 9 months ago

I agree and would have appreciated those from the get-go. For the previous implementation, it was not advertised as supporting HA mode formally and it also used mutexes to lock on the various key prefixes. This serialized writes, however the versions that were returned when certain operations occurred was not very clear.

My understanding is that the CI runs a set of benchmarks, but I am not sure if this implicitly checks for conformance as well?

dereknola commented 9 months ago

The tests do no check for conformance, they only validate basic functionality. IFAIK Conformance checks are only done by QA with a full K3s release.

dereknola commented 9 months ago

The merging of https://github.com/k3s-io/kine/pull/233 has likely caused merge conflict, sorry.

bruth commented 9 months ago

Understood @dereknola and no worries with the conflict.

FWIW, I did dig into the LimitedServer layer that calls into the Backend interface to see how that layer deals with errors and revisions and what not. Not a true conformance test, but did my best to understand the semantics.

brandond commented 9 months ago

Looks like there are conflicts - can you rebase, and we'll see if the tests pass?

bruth commented 9 months ago

@brandond I see the changes to the Watch API, we rebase and update the code today.

bruth commented 9 months ago

Just an FYI, it's likely another NATS server release is going out tomorrow (some fixes along and time with the new Go releases). If we can delay a merge another day, I will update to this new release before getting incorporated into the next k3s release.

brandond commented 9 months ago

watching here for an update - will merge once updated.

bruth commented 9 months ago

2.10.5 is delayed to either tomorrow (Thursday) or Friday. Not sure if you want to move ahead and then I can open a PR to update to the dependency. There are some fixes relevant for the kine/k3s workload, but did not want to block your upstream release.

brandond commented 9 months ago

I'm in no hurry to get anything out, there are no Kubernetes releases in November so we're not doing a K3s release that would even ship any of this until December.

bruth commented 9 months ago

Sounds great, will update later this week.

bruth commented 9 months ago

Updated this to 2.10.5 (released yesterday), so ready to merge from my perspective.