Open jsanda opened 2 years ago
I was brainstorming with @emerkle826 about this. I think he is going to work on a small POC.
This is an interesting idea. What is our policy on separating the control and data planes? This piece of work seems to redraw the line on that separation in a different place to where Kubernetes generally draws it.
The interface boundary in Kubernetes is mostly around POSIX signals, the command
and args
fields in the container spec, environment variables, and the filesystem (on the command side), and whether the application is running (on the query side). In general, that's sufficient for the application to signal to the controller whether it is up, and for the controller to configure the application and signal desired state changes to the application.
This ticket seems to provide a lot more information to the controller about application state. We already have a lot of fuzziness between data and control plane in the k8ssandra ecosystem (which might be absolutely fine). But have we ever asked where we want this interface to be drawn?
Is the intention of this line of work to surface Cassandra "events" as Kubernetes events? In which case, have we considered the additional load this places on the API server in larger clusters and whether that provides advantages over a more purely data-plane oriented repair approach?
Conversely, would it make more sense for this functionality to live in Reaper given it is repair oriented and Reaper naturally speaks JMX? Are there any advantages in involving Kubernetes in this sort of repair process?
The motivation for this is to deal with corrupt commit log segments and corrupt SSTables. I am not sure if that information is available through JMX, but I don't think it is.
Can you elaborate on what you mean by a data-plane oriented approach? That's not clear to me as the remediation steps would likely involve orchestration from one or more controllers.
What is being described would have little to no load on the API server. The management-api agent would be sending event notifications to interested parties. The API server doesn't come into play there.
Can you elaborate on what you mean by a data-plane oriented approach? That's not clear to me as the remediation steps would likely involve orchestration from one or more controllers.
I think we need to be precise about what we mean by controller. Reaper orchestrates distributed repairs but isn't a Kubernetes controller in any sense. You might say that it is an application controller or a data-plane controller, but it doesn't interact with the k8s API in any way.
My line of thought is that a Kubernetes controller works against a fairly limited API which really revolves around CRs on the input and CRI, CSI, CNI as outputs. The proposal here appears to be to have some inputs from Cassandra logging and outputs via JMX, nodetool, or REST to the management API.
My question is whether this set of inputs and outputs belongs in a Kubernetes controller or whether it belongs in Reaper (or a new Java application). One reason that this functionality might belong in the controller is if it needs to be exposed in the CR. Are there any changes to the K8ssandraCluster CR implied by this repair work? I don't think there would be any interaction with the CRI, CSI or CNI here, is that correct?
In the case of a corrupt SSTable I believe the remediation steps are:
With that in mind let me answer the questions.
My question is whether this set of inputs and outputs belongs in a Kubernetes controller or whether it belongs in Reaper (or a new Java application)
Unless Reaper or a new Java application would interact with the API server, then no the inputs and outputs do not belong in them since taking the node offline and doing the node replacement would be doing through calls to the API server.
One reason that this functionality might belong in the controller is if it needs to be exposed in the CR. Are there any changes to the K8ssandraCluster CR implied by this repair work?
I don't foresee any API changes needed in either the CasandraDatacenter or K8ssandraCluster CRDs; however, I think it is worth exploring adding a new status condition that the operation and/or creating k8s events when the operator gets notification of a corrupt SSTable.
I don't think there would be any interaction with the CRI, CSI or CNI here, is that correct?
Correct. The functionality I am proposing would be entirely agnostic of Kubernetes.
John and I spoke offline, it looks like due to the presence of a PVC deletion call in the node replacement logic, there is no way to separate it from the Kubernetes PersistentVolumeClaim
API. As a result, Reaper would not be a suitable place for this logic to sit.
@emerkle826 I'd be interested in helping with the prototype if you want me to.
We can use a logging Appender, possibly a custom one, that can forward events to interested parties.
So the "events" we want to emit are actually log messages emitted by Cassandra, correct? I was thinking initially that we would be talking about controller events. It would be good to clarify this. Or is the expectation that we are going to somehow convert Cassandra log events into cass-operator controller events?
The management-api server would have to be configured with a set of endpoints that should receive event notifications. For cass-operator/k8ssandra deployments, cass-operator would provide the endpoint maybe either as an env var or as a JVM system property. We would want a k8s service in front of cass-operator and pass that to the managment-api server.
I am not fully understanding the design. 2 probably dumb questions:
@adutra I will take thee help if you want as I'm still not quite clear on the specific here.
So the "events" we want to emit are actually log messages emitted by Cassandra, correct? I was thinking initially that we would be talking about controller events. It would be good to clarify this. Or is the expectation that we are going to somehow convert Cassandra log events into cass-operator controller events?
The POC design is really just to see about getting specific Cassandra logs and/or Exceptions out to something (cass-operator likely) so that the something can take action. Initially, it will just be the logs and exceptions associated with corrupt SSTables I believe.
- What would mgmt-api do with the received events? Store them? Or send them to cass-operator (but how)?
The idea is that Management API could use a Logback SocketAppender to ship the desired logs to a remote socket. The problem with this I have found is that the SocketAppender is just going to serialize the Log event objects and send them to the remote socket. That would mean the remote end (presumably cass-operator) would have to be able to deserialize them. That's not a terrible idea if cass-operator were a Java application, but it's not so I don't think it's going to be a good solution here.
I think a better solution would be to create a Logback filter and have the filter implementation push the desired logs/exceptions to a remote endpoint (socket, service, etc) in a more general format (not serialize Logback objects). But maybe we don't even need that. Because....
- Why couldn't we imagine instead a simple side-car that would tail /var/log/cassandra/stdout.log ?
Yes, a sidecar might be a better solution. If a Logback Appender could filter on what we need and get it to a remote socket in a generic parsable format, maybe that would have been easier, but it doesn't seem to be the case. To be honest, the difficult part in Management API doing anything with pushing the logs/exceptions out is going to be the whole pub-sub model underneath. Having to deal with keeping track of where things need to go, figuring out what to do if you can't send things to the remote end... All that stuff overly complicates things. It would be much easier if the entity that is interested in the logs could simply monitor the logs itself.
I was thinking initially that we would be talking about controller events. It would be good to clarify this. Or is the expectation that we are going to somehow convert Cassandra log events into cass-operator controller events?
We are talking about events for Cassandra itself. Reread the problem section in the initial description. I am talking specifically about corrupt SSTable files and corrupt commit log segments. The goal for this issue is to see if there is a way for management-api to surface those errors so cass-operator can know about them.
Yes, a sidecar might be a better solution
Integration with centralized logging would definitely be nice to have. I also think it is worth exploring whether we can provide a solution to these problems without requiring centralized logging. If it looks like it will be a whole lot of work, then the time might be better spent thinking about centralized logging.
To be honest, the difficult part in Management API doing anything with pushing the logs/exceptions out is going to be the whole pub-sub model underneath.
Weren't we in agreement on fire and forget?
Having to deal with keeping track of where things need to go, figuring out what to do if you can't send things to the remote end
You don't have to keep track of where things. The destination address(es) would be provided by cass-operator. For the purposes of a POC, if you can't send things I would just log a message.
Integration with centralized logging would definitely be nice to have.
Don't we already have that? We output logs and centralized tools can easily gather them. We only need to fix that "why isn't my Cassandra starting and there's no logs" issuse.
I also think it is worth exploring whether we can provide a solution to these problems without requiring centralized logging. If it looks like it will be a whole lot of work, then the time might be better spent thinking about centralized logging.
Why aren't we just providing Kubernetes events here? We detect the issue in management-api when it reads the logs and then output an event when something bad happens. Then it's up to whoever picks up the events to ensure the correct processing is done (cass-operator can listen to Kubernetes events and do its processing, no problem). This is the process even Openshift does internally (it has Kubernetes events collectors operator).
In the case of a corrupt SSTable I believe the remediation steps are: Take C* node with corrupt SSTable offline Run repair on the offline node's ranges on the rest of the cluster Do node replacement
Also this part. Is this process correct? Why would we need to replace the entire node if we have a single corrupt SSTable? Shouldn't: take node offline -> remove file -> restart -> repair the ranges
be enough? Why would the whole node need to be replaced also?
Also, this process might have a correct place in k8ssandra-operator and not cass-operator, if we only wish to utilize reaper for this process. If CassandraTask with repair for those ranges is enough, then cass-operator can do it.
Integration with centralized logging would definitely be nice to have.
Don't we already have that? We output logs and centralized tools can easily gather them.
No, I don't consider outputting logs integration with centralized logging. If we had an option for our controllers to query ElasticSearch for errors, I would consider that an integration.
Why aren't we just providing Kubernetes events here? We detect the issue in management-api when it reads the logs and then output an event when something bad happens.
This could work! @emerkle826 do you want to explore this?
Also this part. Is this process correct?
I got these steps directly from Jeff Jirsa, so I assume they are. It would be good to loop him back into the conversation. @pmcfadin is the one who got this ball rolling :)
Also, this process might have a correct place in k8ssandra-operator and not cass-operator, if we only wish to utilize reaper for this process. If CassandraTask with repair for those ranges is enough, then cass-operator can do it.
agreed
This could work! @emerkle826 do you want to explore this?
@jsanda I can. Let me know if you want me to create a separate issue for it, or just roll it all into this issue.
Problem
SSTables and commit logs can and do get corrupted. These types of errors can cause a Cassandra node to shutdown. In the context of Kubernetes and cass-operator in particular, we lack visibility into these errors. What's worse, the management-api will just keep restarting Cassandra with cass-operator completely unaware of what is going on. It will require outside intervention to resolve the issue. Resolving these sorts of errors is non-trivial for users. cass-operator should be able to handle them for the user.
In order for cass-operator (or anything else that is managing Cassandra in conjunction with the management-api) to remediate these errors, it needs to know about them.
For reference see https://github.com/k8ssandra/k8ssandra/issues/680.
Solution
We can use a logging Appender, possibly a custom one, that can forward events to interested parties.
Ultimately I want cass-operator to receive the events. The management-api agent could provide the Appender. This gives the agent a hook into these events. The agent would in turn notify the management-api server via an RPC call. AFAIK RPC calls only go from server to agent today; so, we would have to explore making the calls bi-directional.
The management-api server would have to be configured with a set of endpoints that should receive event notifications. For cass-operator/k8ssandra deployments, cass-operator would provide the endpoint maybe either as an env var or as a JVM system property. We would want a k8s service in front of cass-operator and pass that to the managment-api server.
For an initial implementation the event notification calls can be fire and forget to keep things simple and lightweight. The event payload should be easy to consume, e.g., a JSON object that is easy to parse or maybe a gRPC endpoint.
This appender could be turned off by default and and enabled with a system property at least until we have confidence in its stability. Furthermore, the appender should be turned off (or a no-op) if there are no subscribers.
Additional Considerations
I specifically mentioned corrupt SSTables and commit logs, but this event notification mechanism could apply to other errors and interesting events.
I briefly looked to see if we could get these events through JMX. I'm not sure if we can, but it is probably worth investigating. Even if we can, consuming JMX events for non-JVM languages can be a challenge.
The solution I am proposing involves a good bit of orchestration. This would be handled by cass-operator, but it could also be done by something else if someone is using the management-api without cass-operator.
Lastly, my goal for this issue is to present a high level design and have some discussion. I anticipate the actual work to be broken out into multiple tickets.
┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: MAPI-44