gluster / anthill

A Kubernetes/OpenShift operator to manage Gluster clusters
http://gluster-anthill.readthedocs.io/
Apache License 2.0
35 stars 12 forks source link

Implement Procedure execution as well as first empty procedures and actions. #56

Closed rohantmp closed 5 years ago

rohantmp commented 5 years ago

Describe what this PR does:

Is there anything that requires special attention? Do you have any questions? Did you do something clever?

rohantmp commented 5 years ago

@humblec @rohan47 @sidharthanup

humblec commented 5 years ago

@rohantmp can you please discribe the PR in detail in the commit ? If possible please fill the template, that would help reviewers and also help on recording it in the git log.

humblec commented 5 years ago

@JohnStrunk I also really want to discuss what the candidate actions are. Since order is not preserved in procedure level actions, every possible scenario I can imagine has 1 procedure level action with everything else implemented as a prereq.

Above should go as second PR comment instead of first. :)

rohantmp commented 5 years ago

@humblec is that better?

jarrpa commented 5 years ago

@rohantmp If I understand the intention of Procedures, each will be an ordered list of actions that represents the full reconciliation procedure for a given version of the Operator. The ProcedureList, then, enforces an order of Procedures based on their version attribute. Thus if a GlusterCluster has a currentVersion that is less than the actual current version of the Operator, the Operator must perform all Procedures between currentVersion and it's actual version in sequential order.

@JohnStrunk Do I have it right?

rohantmp commented 5 years ago

@jarrpa

jarrpa commented 5 years ago
* Procedure is [unordered](https://github.com/gluster/anthill/blob/eb675e1d6b48b5caa7818d6b845f48e4d51c4e5f/pkg/reconciler/procedures.go#L93).

...okay I definitely don't understand why that is. :)

* I am not sure when that sequential update should be performed. It might be when the user modifies the GlusterCluster CR or it might be automatic.

Kubernetes generates a periodic Update action on all resources every second (or two?), which would enqueue a new reconciliation for each GlusterCluster.

JohnStrunk commented 5 years ago

Good start. Let me clarify some intentions...

ProcedureList/Procedure/Action/Prereq The ProcedureList is the top-level. It defines the entire operator behavior-- across and between all versions. The NewestCompatible() call is designed to grab the Procedure that is of the highest version (i.e., the newest/most advanced) that is capable of working with the currently deployed configuration.

A Procedure defines the full reconcile behavior to achieve some desired state. The "latest" version of Anthill will have a Procedure (the one w/ the highest currentVersion) that defines how this version of the Anthill should manage a Gluster cluster. If left to reconcile indefinitely and not perturbed, the system state should converge to this. All lesser versions of Procedures that end up in the code will be there for only two reasons: (1) At one time, they were the highest version OR (2) They were added to assist in sequencing a multi-step upgrade.

Returning to NewestCompatible(): This looks at the currentVersion in the CR. An initially created CR doesn't have a currentVersion field (afterall... there's nothing deployed). Only when a Procedure achieves fullyReconciled will the currentVersion be set in the CR, meaning we have succeeded with an initial deployment. At a later time, Anthill is upgraded, and higher version Procedures are added. Instead of just jumping to Newest() and forcing that on the cluster, NewestCompatible() will choose a Procedure that is compatible with what is currently deployed. The version of the Procedure will likely be greater than CR.currentVersion, and this is desirable... we're upgrading the system. When that Procedure reaches fullyReconciled, CR.currentVersion will be updated to match Procedure.version, indicating it has been fully upgraded. This process will repeat until the chosen Procedure is the highest one in Anthill, meaning we're totally up-to-date.

Unordered Actions Action is the "unit of work", and a set of them constitute a Procedure. They are intentionally unordered in the Procedure to enforce the notion that Prereqs are to be used. If the list of Actions were guaranteed to be run top-to-bottom, it would be easy to accidentally rely on item[n-1] being true when item[n] is executed. There should be no implicit dependencies like that as it makes figuring out proper ordering much more difficult and error prone. Instead, if something is depended-upon, it must be listed as a Prereq... no exceptions. The randomization enforces that.

Is it an Action or a Prereq? If it mutates system state, it's definitely an Action and should appear in the action list of the Procedure. It may also be a Prereq for some other action... That's ok.. List it as a Prereq also. The caching infrastructure will ensure it gets executed only once, and due to the Prereq entry, the ordering will be correct also... even with the randomization! Why list it both places (i.e. why not "1 procedure level action with everything else implemented as a prereq")? That will technically work, but don't do it. Only items in the action list get entries in status.reconcileActions, so there will be no visibility into their status. Also, Prereqs are ordered, and they stop executing as soon as one fails. (This is for efficiency purposes... Put the cheap ones first.) So while all Prereqs will eventually execute over some number of reconciliation cycles, there are things that can happen in parallel, but relying on the Prereq structure will needlessly serialize them. I intentionally defined Prereqs and Actions to have the same signature for these reasons.

Concrete suggestions Right now, the target is deploying the etcd cluster.

We can also work on deploying the CSI driver (it doesn't rely on etcd)

With the above, we have 4 Actions in the "v0" Procedure.

JohnStrunk commented 5 years ago

Code structure:

sidharthanup commented 5 years ago

Why list it both places (i.e. why not "1 procedure level action with everything else implemented as a prereq")? That will technically work, but don't do it. Only items in the action list get entries in status.reconcileActions, so there will be no visibility into their status.

Nice! This cleared some doubts I was waiting to ask.

@rohantmp mentioned in one of the discussions if we could execute the procedure actions concurrently. I am of the same opinion since the procedure level actions are unordered. Although we would have to carefully use mutexes. I don't think a channel based synchronisation will work here. A discussion for another PR but I wanted to know your thoughts on concurrency and whether it would be feasible here @JohnStrunk @jarrpa ?

JohnStrunk commented 5 years ago

I wanted to know your thoughts on concurrency and whether it would be feasible here

I think it would require careful consideration of what the actual bottleneck is. All Actions that can be run, will be run for each trip through the reconcile loop. Parallelism would only help us in speeding the execution of the loop itself (and then, most likely if it's CPU-bound). Before we worry about the complications of synchronization, let's make sure we have a problem. I expect the majority of the time to reconcile will just be waiting around for the system state to converge.

jarrpa commented 5 years ago

I'm with @JohnStrunk there. Also, offhand it feels like any gains from parallelization would not be meaningful, so better to go the simple route first (in part just to get it done) and optimize later should the need arise.

rohantmp commented 5 years ago

@JohnStrunk

* I have mixed feelings on separate packages for the Procedure versions. I like the encapsulation, but I expect much of the code to get shared, making separate packages kind of clunky. This could go either way.

I think we can solve this by exporting each action in the subpackage?

rohantmp commented 5 years ago

@JohnStrunk

* Really, really minimize changes to `glustercluster_controller.go`.

We could move the entire reconcile function out to a separate file in the same package?

JohnStrunk commented 5 years ago

I think we can solve this by exporting each action in the subpackage?

With an action being a unit of work, why not just skip the subpackage and have 1 action = 1 file in pkg/controller/glustercluster/? So there would be a etcdClusterCreated.go that defines that action.

We could move the entire reconcile function out to a separate file in the same package?

That would probably be best. func (r *ReconcileGlusterCluster) Reconcile(request reconcile.Request) (reconcile.Result, error) could then just be a 1-line function that hops over to a function in (a newly created) reconcile.go that also defines allProcedures and is the place that pulls it all together.

rohantmp commented 5 years ago

@JohnStrunk could you take a look at this latest incarnation? I've been understanding more of how the individual pieces of gcs fit together and I want to see if this is how to proceed with the procedure code layout so I can start implementing it.

rohantmp commented 5 years ago

@JohnStrunk Should we pass the instance object to the actions? Is there a reason to fetch the instance each time we need to access the object spec?

JohnStrunk commented 5 years ago

@JohnStrunk Should we pass the instance object to the actions? Is there a reason to fetch the instance each time we need to access the object spec?

While that would probably be more efficient, I don't think we want to force actions to all take the same instance type. I think it's likely that there will be actions that act on other object types or even on no kube objects at all. I believe the client constructed by the sdk is a caching client that will minimize repeat hits to the kube API server, so we shouldn't be causing any real issues by repeatedly looking up the same object. (but I have not verified this)

rohantmp commented 5 years ago

@JohnStrunk Does that resolve everything? Can we merge this so we can proceed with action implementation?

Speaking of action implementation, I'm mapping out some of the objects and their ownership in a spreadsheet.

rohantmp commented 5 years ago

~~@JohnStrunk the only part of CI that's failing now is marked nolint for the linter that's failing (errcheck). This is generated code.~~

DEBUG: [Feb 21 06:11:37.491] nolint: matched errcheck:68-68 to issue cmd/manager/main.go:68:15:warning: error return value not checked (defer r.Unset() // nolint: errcheck) (errcheck)

Edit: errcheck doesn't seem to support ignoring lines, and this line is no longer present in the latest version of operator-sdk. Edit: I manually ran each linter that threw a warning. gocyclo is the one that's failing. Is the cyclomatic complexity bad in the context of the reconcile functions? Can I safely ignore it?