iter8-tools / iter8

Kubernetes release optimizer
https://iter8.tools
Apache License 2.0
252 stars 34 forks source link

[Enhancement] [Repurposed issue] Developing a new `candidate` CRD for Iter8 #692

Closed huang195 closed 2 years ago

huang195 commented 3 years ago

This is to document a Slack discussion, regarding the following issue:

I'm creating an experiment CR resource without having deployed the candidate version in the cluster yet, and I was expecting that the experiment was going to be in some sort of paused state (IIRC that's version 1's behavior). However, what I observed what the experiment started to run even without the candidate version deployed. Is this the current expected behavior?

$ kubectl get experiments.iter8.tools
NAME               TYPE     TARGET                STAGE       COMPLETED ITERATIONS   MESSAGE
gitops-exp-17713   Canary   default/productpage   Completed   10                     ExperimentCompleted: Experiment Completed

and the candidate pod is not running

$ kubectl get pods
NAME                                 READY   STATUS    RESTARTS   AGE
details-v1-7876c8fb77-t67ck          1/1     Running   0          2m33s
productpage-stable-897b84d99-s8z9x   1/1     Running   0          2m33s
ratings-v1-6bd96496f7-g9wn4          1/1     Running   0          2m33s
reviews-v2-8565d9b857-hfwr8          1/1     Running   0          2m33s

Summary of the discussion so far is:

@sriumcp is suggesting that task framework is the right place to check and pause an experiment when this scenario happens (i.e., candidate not yet deployed).

My view is this is a valid suggestion as the task framework is generic enough that such things could be done there. However, I wonder if such complexity should be exposed to users in the form of a task, as I couldn't think of any reason why someone would start an experiment without all the components being ready, e.g., candidate an baseline deployed. If that's true, iter8 should check for this condition internally before starting the experiment.

sriumcp commented 3 years ago

The trouble with putting this in etc3 is -- how will the "readiness check" be implemented for different Iter8 domains? For KFServing, Knative,and k8s deployments -- readiness checks look different.

Even within a specific domain, different experiments might need different readiness checks. We want to keep etc3 domain agnostic, and have gone to great lengths to ensure this is the case.


K8s controllers are not intended to be complicated — simple idempotent reconcile loops is what they’re intended for. etc3 is already well past this point, and tacking on more and more onto it is a recipe for creating a unwieldy and unstable controller, at the heart of Iter8.


I also agree with the fact that users' should not be exposed to needless complexity.

I believe one way to accomplish both of the above is to create helm charts which installs experiments -- the complexity of the experiment is abstracted within the helm chart (which includes readiness checks, etc.).

What the user gets to see is only the values file, which can have a few (really few) parameters (for example, just the service name / namespace).

We can create as many Helm charts / templates to bake in as many scenarios as we wish to expose.

And we can also document how to create the Helm chart so that the user has the option of doing so themselves.

sriumcp commented 3 years ago

Experimentation does require templating in some form, and Helm is the top templating solution for K8s.

We can certainly explore other options like Kustomize / Jsonnet after some experience with Helming.

sriumcp commented 3 years ago

In environments like IBM Code engine and IBM DevOps toolchains, Iter8 is essentially going to run in a separate K8s environment -- altogether different from the one where the target application is running (at least to begin with).

Specifically, in CE, Iter8 will be packaged as containerized K8s in docker (Kind) image which can be run as a serverless application.

In toolchains, Iter8 be packaged as a Tekton task which can be imported within a pipeline, and used for various acceptance checks.


Readiness checks are required here as well... and can be accomplished through well crafted tasks intended for those purposes.


huang195 commented 3 years ago

I'm more concerned about the complexities in Experiment CRD, and less about how iter8 is deployed to different environments like CE and IBM Toolchains, for which, I'm certain the use of Helm or Kustomize can improve user experience.

Looking at the experiment CR files we're using in various quickstart tutorials, they are about 60 lines long (https://github.com/iter8-tools/iter8/blob/master/samples/kfserving/quickstart/experiment.yaml). Comparing to Flagger's (https://docs.flagger.app/usage/how-it-works), which is about 30 lines long (that also includes a webhook to start a workload generator), it's much more complicated than Flagger. Even the previous version of Iter8 used a much simpler Experiment CR in the quickstart - only 20 lines (https://github.com/iter8-tools/iter8-istio/blob/master/docs/tutorials/canary.md). Without any prior knowledge about online experimentation, a user could read Iter8 v1 and Flagger's Experiment CR files and get a rought idea about what the experiment will do. However, the same user will get lost reading Iter8 v2's Experiment CR file, even the one used in the quickstart guide.

Is the intent to have users not interact with Experiment CR files directly but rather use Helm as the main user interface? My understanding is templating is good for configuration items, e.g., port number, backend DB address, etc. - like the things you put in a configuration file that's read by a service when it starts up. The content of Experiment CR is much more complicated. Is it even possible to facade it in the same way?

So I want to i) understand if you also view this as a problem, and ii) if so, how will the use of Helm solve this problem? If the intent is to keep adding to the handler task section of Experiment CR to perform various mundane tasks (like checking if the candidate is already deployed before starting the experiment), the Experiment CR files will get much more complicated than the ones we use in the quickstart guide.

huang195 commented 3 years ago

Here's an idea: Is it possible to create higher level CRDs using the existing Experiment CRD as a base?, e.g., Experiment-KFServing and Experiment-Istio. The higher level CRDs will have a cleaner interface, hiding a lot of details from users, e.g., handler tasks could be completely removed from these higher level CRDs. When a higher level CRD is created, it always instantiates a base Experiment CRD under the cover with all the gory details. This allows new users to not get intimidated with all the details. At the same time, more advanced users can use the "raw" Experiment CRD to do a lot of customizations directly.

sriumcp commented 3 years ago

@huang195 I do think Iter8's experiment CR are getting pretty complex... but IMO, the solution is not to extend the experiment CR (and consequently, etc3) functionality for every feature we want to implement... If at all possible, I would in fact like to remove some of the Iter8 etc3 / experiment CR functionality if possible to make etc3 leaner.

If we want to extend the behavior supported in an experiment, tasks are definitely the first option to consider in general; this is especially true for this task under consideration.

Tasks are expressly meant for the purpose of extending etc3 behaviors in ways like the above.


Iter8's experiment and Flaggers candidate resource are completely different abstractions.

The latter is associated with an application, the former is associated with a one off experiment (i.e., with an individual version).


I believe Iter8 is sorely missing this application-level abstraction. I completely agree with you that we need to find ways to simplify Iter8's experimentation.


I think your "higher-level CRD" idea is a really good idea. I used to think of it is as lifecycle CR earlier, and more recently as an Iter8 Candidate CR, directly inspired by flagger.

I have some thoughts on how this higher level CRD could in fact use helm and helm values to accomplish what you are suggesting...

The idea being, you don't have to create 10 different CRs (Experiment-KFServing, Experiment-Istio, Experiment-Linkerd, Experiment-OpenHorizon, etc. all of which we want to definitely go after).

You can still create a single candidate CR -- which watches for the resource of your choice (could be deployment for Istio, or ksvc for Knative or isvc for KFServing) -- and uses a helm chart to install the experiment.


sriumcp commented 3 years ago

Actually, this is an excellent thread for us to start imagining what that higher-level CR might look like... we don't have to restrict ourselves at this early design stage; we can (and should) go crazy in terms of putting out options in terms of various possible higher level CRs. Lets assess pros and cons and figure out how to move forward in terms of this idea -- which I certainly think is a high-value idea (in terms of simplification).

huang195 commented 3 years ago

Yes, there are disadvantages to the higher level CRDs idea, but now I understand what you mean by "templating" at the conceptual level. The thing that still bothers me with templating is imagining once all this work is done, in our quickstart guide, we'll say do this helm install .. command, and here's the Experiment CR file that was deployed. Then, the user will be hit with a very long and cryptic yaml file with all its gory details. That is something the higher level CRD idea is trying to avoid, as we only need to show the user the entity that gets deployed to your cluster is this nice clean higher level CRD (in the GitOps case, this is also the entity that gets stored in the Env repo). Yes, there's an ugly base Experiment CR that gets generated somewhere, but that's not something the user needs to be exposed of, especially new users.

I'd like to iterate on our various designs with you here with pros and cons, and flesh out the details once we can agree on something.

sriumcp commented 3 years ago

@huang195 absolutely agree with your above comment.

Here's how I'm thinking of Helm playing into your higher-level CRD idea. This combines elements of Flagger's canary resource (watch) with RedHat's Helm Operator (helm-chart + values).

Candidate resource 1:

apiVersion: iter8.tools/v2alpha2
kind: Candidate
metadata:
  name: productpage-app
spec:
  # target identifies the service under experimentation using its fully qualified name
  target: bookinfo-iter8/productpage
  watch: # watch and do something whenever the resource below is updated in-cluster
  - apiVersion: v1
    kind: deployment
    namespace: bookinfo-iter8
    name: productpage-candidate
  pre-empt: true
  chart:
    istio/deployment # helm repo/chart that templates Iter8 experiments for deployments over Istio
  values: # optional; target and resource being watched are always passed in as values.
    testingPattern: canary

Candidate resource 2:

apiVersion: iter8.tools/v2alpha2
kind: Candidate
metadata:
  name: flowers-app
spec:
  # target identifies the model under experimentation using its fully qualified name
  target: default/flowers
  watch: # watch and do something whenever the resource below is updated in-cluster
  - apiVersion: serving.kubeflow.org/v1beta1
    kind: InferenceService
    namespace: default
    name: flowers-candidate
  pre-empt: true
  chart:
    kfserving/isvc # helm repo/chart that templates Iter8 experiments for InferenceServices over KFServing

Main difference between candidate and experiment resource.

Candidate is created once per application, and will automate the launch of multiple experiments during the lifetime of the application (think simple)

Experiment is created once per experiment (and tied to the specific baseline and candidate versions under experimentation) (think gory details)

sriumcp commented 3 years ago

From a gitops perspective, all that the user does is update productpage-candidate deployment in Git in case 1 and flowers-candidate in case 2. Iter8 takes care of the rest, (assuming the candidate CRs above are deployed in cluster).

Every time Iter8 detects a change to the resource it is watching, it performs a helm install upgrade <an experiment> which (conceptually) deletes the old experiment (if any) and creates a new one. Helm upgrades already work this way. This is the pre-emptive experimentation behavior we have discussed so far.

huang195 commented 3 years ago

@sriumcp That is very Flagger-like indeed, and a major architecture change to support the new CRD. I wonder if there's any middle ground where we can get similar behavior but based on the existing v2 architecture -- maybe there isn't. Will think about it.

sriumcp commented 3 years ago

@huang195 Just in case it didn't come through cleanly in my comments above, this new CRD doesn't replace the experiment CRD... it builds on top (under the covers, experiment CR is still getting created).


It is indeed a big addition to the current Iter8 design, but no change to experiment CR itself. Besides, while the change is big in terms of the functionality, I think the reconcile loop implementation for this CR can be very simple: watch for resource, construct a helm values object, and do a helm install upgrade with the given helm chart. That's it... if things go wrong during this simple process, mark it in the candidate CR status and call it a day.

Is this enough?

In some sense, RedHat's Helm operator does even less (it doesn't watch for anything), and it is one of the most insanely popular operators out there. So perhaps... this is all that we need to start with.

The experiment reconciler has a logic that is significantly more complex (needed in order to support current experiment functionality; you can look into the reconciler implementation to get a sense of this complexity).


This proposal achieves the following. It:

  1. avoids the need to extend experiment reconciler or CRD -- which are complex enough already,
  2. keeps the release automation process (including CI/GitOps) drop dead simple for SREs when Iter8 runs in the same cluster as the target app, and
  3. still retains all the flexibility of the experiment object (you can use values to configure a specific experiment template; or create a brand new experiment chart altogether to support a new type of experiment).

    The Candidate idea is obviously inspired by Flagger. But also keep in mind the RedHat helm-operator connection... that's where I got the idea for Helm chart with values inside CR from.


    Assumption involved in point 2 above is not true for code engine and many toolchain environments, so the raw experiments or just the Helm-templated experiments (without the Candidate CR) have a much more direct role to play in such environments.

sriumcp commented 3 years ago

Incidentally, I changed the issue label from [Bug] to [Enhancement] :slightly_smiling_face:

We explicitly discussed readiness checks during the design of etc3 and made a conscious decision not to support it in the experiment CR (both to limit complexity, and also to focus on multi-domain experiments), so the implementation is working exactly as designed.

Obviously, it is a new feature / enhancement we want to support in Iter8 (through tasks -- ultimately hidden under the higher-level CRD / helm / whatever abstraction we choose).

huang195 commented 3 years ago

@sriumcp looks good to me at high level. So it looks like we're going to have a dependency on Helm now? But it looks like user is not exposed to it, and we're just using Helm to internally generate Experiment CR from Candidate CR, it shouldn't matter to user either way. Once you have a working example, looking forward to seeing it.

kalantar commented 3 years ago

This also looks good to me from a high level. I also don’t like a dependency on helm. Perhaps we should think of the HL object as identifying a template for an experiment. Such a template might be specified in different ways: in line, with another cr, with a helm chart, with a kustomize folder could be variable

In simplifying the CR, we might also consider whether the version details should be moved to another CR and just referenced. A lot of the complexity is in the versionInfo field. We could introduce a “version” or “subject” cr that records information about a single version or a set of versions. And this can be used to drive the watching/reading/modifyfing weights.

We might also create cr for the tasks. Then even if a task is shell script, it appears in an experiment as a simple thing.

I know there is a hesitation about creating a lot of CRDs and the controllers to manage them. However, there is a tradeoff with simplicity we should think about.

sriumcp commented 3 years ago

Re: Helm dependency, FluxV2 has a dedicated GitOps controller that is based purely on Helm. It also has a dedicated controller that is based purely on Kustomize.

RedHat has a dedicated operator for installing Helm charts.

RedHat also provides the Operator SDK for creating new operators using Helm charts.


Helm is by far the most popular templating and resource configuration tool for K8s. Which is partly the reason for the choice of Helm. I believe making a single higher-level CR do everything (Helm / Kustomize / my favorite sed / awk based templating) will defeat its primary purpose, which is simplification.


What the above design does is to template the experiment in the form of a Helm chart. It does not restrict the application deployment itself, which can be using Helm, Kustomize, kubectl yaml/json manifests, etc.


What the above design also does is watch for deployment / custom resources and eliminates the need for CI/CD to create an experiment each time there is a new version. Templating is part of the solution, and ensuring that there are a few template variables to substitute dynamically during each release, and automatically figuring out what those values are (instead of user constructed CI/CD having to do it), is a major part of the simplification.

To be explicit about the process: when the user creates an application, the user also creates a Helm chart for experimenting with that application, and installs a Candidate CR for the application in the cluster.

The CI/CD from then on doesn't have to focus on Iter8 all -- especially, experiment creation / set up is no more a part of the pipeline, merely the release of the candidate version, which it anyway needs to do.


I believe the goal of simple application release automation can be achieved through the above design by carefully crafted helm experiment charts. The values exposed by the chart needs to be well thought out (and necessarily opinionated) so that we don't provide 1000 options for the user in a single chart, but provide an intelligent way to set a few (example, 5 or max 10 different simple values (strings/ints)) in the context of a single chart. Each chart will correspond to some experimentation pattern that users would care about.

Again, chart creation happens once, when the application is defined. Values are derived automatically from the context of the Candidate resource. User's CI/CD pipeline doesn't need to be aware of Iter8 for the most part.


Of course, creating a Helm chart is cheap, especially when there is a single experiment object to template. We can document this process for a few charts, so that users can craft / contribute their own charts as well.

huang195 commented 3 years ago

I wonder if we can simplify things even more, e.g., using Sri's example from earlier:

apiVersion: iter8.tools/v2alpha2
kind: Candidate
metadata:
  name: productpage-app
spec:
  # target identifies the service under experimentation using its fully qualified name
  target: bookinfo-iter8/productpage
  watch: # watch and do something whenever the resource below is updated in-cluster
  - apiVersion: v1
    kind: deployment
    namespace: bookinfo-iter8
    name: productpage-candidate
  pre-empt: true
  chart:
    istio/deployment # helm repo/chart that templates Iter8 experiments for deployments over Istio
  values: # optional; target and resource being watched are always passed in as values.
    testingPattern: canary

Can this simply become:

apiVersion: iter8.tools/v2alpha2
kind: Candidate
metadata:
  name: productpage-app
spec:
  # target identifies the service under experimentation using its fully qualified name
  target: bookinfo-iter8/productpage
  pre-empt: true
  chart:
    istio/deployment # helm repo/chart that templates Iter8 experiments for deployments over Istio
  values: # optional; target and resource being watched are always passed in as values.
    testingPattern: canary

As chart: istio/deployment will implicitly assume that the target bookinfo-iter8/productpage is a deployment, and its candidate version will follow a certain naming convention. Furthermore, instead of expose the use of Helm directly in the CR, I wonder if we can make it less Helm-dependent, e.g.,

apiVersion: iter8.tools/v2alpha2
kind: Candidate
metadata:
  name: productpage-app
spec:
  # target identifies the service under experimentation using its fully qualified name
  target: bookinfo-iter8/productpage
  pre-empt: true
  kind: istio/deployment
  values: # optional; target and resource being watched are always passed in as values.
    testingPattern: canary

This draws idea from higher level CRDs. Instead of creating a whole bunch of new CRDs, we can use kind: istio/deployment field to specify what type of higher level CRDs the user is interested to instantiate. How this is implemented under the cover is up to us - it can be Helm, Kustomize, sed/awk, or even a combination of these. Most importantly, this allows us to decouple from a particular technology and provide users a simplified interface.

sriumcp commented 3 years ago

I like the kind idea above... only thing I would do is make it a bit explicit. kind: helm-chart/istio/deployment vs. kind: kustomize/kfserving/service vs. kind: sed/knative/service.

This naming convention is up to us to evolve of course.

huang195 commented 3 years ago

Well, the point is we might want to hide implementation details from user if possible, so whatever we choose to implement behind the scenes can be whatever we choose. Unless we plan to implement both kustomize/istio/deployment and helm-chart/istio/deployment, do we really want to expose this detail to users?

sriumcp commented 3 years ago

I do believe users will want to implement their own helm charts and plug it into this CR -- using our Helm charts as a starting point.

I really don't think kustomize is that great a fit for this use-case. Fundamentally, kustomize is a template-free resource configuration tool, while we're really trying do templating here.

Perhaps sed/awk/your favorite shell tools/Go-templates/Jsonnet are other possibilities. And may be they should be bundled under custom.


The user still does not have to really look into the Helm chart -- if the level of templating the chart offers and the way it sets up the values is sufficient for the user's purposes.


Updated comment above to introduce diversity of domains.

huang195 commented 3 years ago

TBH, any of the above would be an improvement over the existing CRD. We can always keep improving it once it's implemented.

sriumcp commented 2 years ago

Subsumed by #948