Open bruth opened 1 year ago
Looking forward to seeing this
cc @rancher-max @cwayne18
This is cool and a great feature suggestion! Thank you!
I have some clarifying questions to determine how deep down the proverbial rabbit hole we should go:
cluster-reset
/cluster-reset-restore-path
functionality?
b. Would it be a new command?
c. Does it follow nats' approach or is it done differently?etcd files are required even if embedded etcd is not in use.
Those are all good questions!
At the moment I see the embedded NATS as a replacement for sqlite only; while it is possible to host a multi-node cluster using the embedded NATS server, @bruth or someone on his team will need to provide instructions on how to set this up as I believe it requires a user-managed config file to accomplish.
If it is desired that K3s support multi-server clusters by managing the configuration and cluster membership, allow for backup/restore using the embedded NATS datastore, and all the other stuff that would provide complete parity with the embedded etcd datastore, I think that would also need to be driven by someone on the Synadia side.
i agree in that some Ops aspects need to be added or documented.
need to provide instructions on how to set this up as I believe it requires a user-managed config file to accomplish.
This can be accomplished programmatically without config files for this particular setup. The Kine integration relies on the NATS server package which makes all of the config options available to be configured.
Since this would be a k3s feature, we would likely need to add support for additional query params on the Kine endpoint to indicate "cluster-mode" for example. But that design can get worked out to prevent needing users to manually define config files. It should be opt-in if they want more control, but not required.
the other stuff that would provide complete parity with the embedded etcd datastore, I think that would also need to be driven by someone on the Synadia side.
That is the intent for sure and why I am looking for guidance to understand the scope of complete parity! I don't want to boil the ocean in one pass if there is too much, but this is a good first list.
- Is k3s expected to supply backup/restore functionality?
If this functionality sits behind an interface, then we can hook in NATS standard method of backing up stream/consumer state as well as restore. I will need to read up on what k3s does today to compare.
- Should an operator be able to run NATS in their cluster while also using it as the embedded datastore?
They certainly should be able to run an additional server/cluster in k3s itself independent of the embedded one if they choose to. They shouldn't need, however I could understand the argument that they don't want to mix k3s and application concerns or the potential for applications impacting the embedded server/cluster and prefer a clear boundary.
One could say the same about etcd, but one distinction with NATS is that with it's multi-tenancy support, the k3s/kine state and messaging would be completely isolated from any applications.
In terms of recommended approaches, have a set of use cases and/or considerations in whether to reuse the embedded cluster vs. running another container should be sufficient for people to make that decision.
- Should NATS certs be rotated during manual certificate rotation?
Based on the link it looks like k3s is temporarily shutdown to do the cert rotation? That would certainly work for NATS as well. Custom CAs can be set in NATS config as well.
Hey @VestigeJ, I saw you assigned this to yourself! Are you actively working on this or interested in collaborating?
Hey @bruth I DM'd you back on your home Slack if you want to work together I'd be more than happy to. :)
@bruth Did this get put onto a back burner on the Synadia side?
@VestigeJ I think we're waiting on
@VestigeJ if it has been put on a back-burner then it would be very unfortunate that @bruth chose to highlight it on a recent podcast.
@udf2457 that comment is probably best directed at @bruth himself, not anyone on the K3s team. NATS support is maintained by the Synadia folks.
@udf2457 This was a temporary back burner.. focus has been on the NATS 2.10 release the past couple months. The KINE PR works, but there are a couple remaining subtle recovery issues to address (likely tweaking a couple timeouts). Now that it is out, focus is shifting back and will have an update next week.
Hey folks, just giving a quick update so it doesn't get lost in the void again. I made some more progress today on the Kine PR (k3s-io/kine#194), including porting the client code to the new JetStream API. I am debugging a few remaining things, but planning to have it ready for review and merge early next week.
As it pertains to this issue, it will support HA mode without needing to change anything in k3s itself. This is a simpler option/better outcome IMO given how intertwined etcd as a dependency is (outside of kine).
Regarding backup/restore this can be achieve out-of-band using standard NATS utilities. If there is a strong desire to get them baked into k3s utilities, I am happy to move that along along.
Converted https://github.com/k3s-io/kine/pull/194 to ready for review. There are some final bits to clean up and testing a couple failure cases, but in a good spot. Docs will come in the next couple days.
Bumping this back out; embedded nats support is still disabled by build flag. We'll need to add -tags nats
to the K3s build flags to enable this.
At the moment nats only supports external servers.
@brandond Other than documentation, what would be helpful to have this be supported in v1.29?
Docs would be good, and maybe get a PR open now to add the build flag so we can see what the current size impact is?
Looks like it adds about 2MB to the K3s size. I'm seeing the binary go from 58MB to 60MB
derek@degion:~/rancher/k3s$ ls -lh ./dist/artifacts/
total 247M
-rwxr-xr-x 1 derek derek 60M Nov 16 09:54 k3s
Testing note - stalled currently for December or January releases
@brandond Is this feature still planned?
Conformance tests need to pass first:
Is your feature request related to a problem? Please describe.
Currently, embedded HA is supported only by etcd. With the option of embedded NATS that was added to Kine (as of v0.10.0/v0.10.1), NATS can be another option since it supports native clustering as well.
Describe the solution you'd like
Add native support for NATS as an alternative cluster option when doing
--cluster-init
.Describe alternatives you've considered
There are no other native options, however, using external NATS configuration (when configuring the
--datastore-endpoint
), the nodes can be clustered without the k3s layer being aware that it is clustered. This provides HA/FT of the KV data, but k3s is unaware of this and not technically running in clustered mode.Additional context
I plan on contributing this, but any guidance or things to be aware of is welcome!