opendatahub-io / ai-edge

ODH integration with AI at the Edge usecases
Apache License 2.0
10 stars 19 forks source link

Consider triggering follow-on pipelines on success of earlier pipelines #68

Closed grdryn closed 1 year ago

grdryn commented 1 year ago

Currently we have 3 pipelines, that are roughly:

The normal flow of things is in that order: Build -> Test -> Rollout

Right now, someone has to kick off each one manually after the success of the earlier ones. I think that we should consider either just having one pipeline that does all of the steps, or have a separate outer pipeline that triggers and monitors the existing ones, or have triggers that will kick off subsequent pipelines after the success of earlier ones.

piotrpdev commented 1 year ago

just having one pipeline that does all of the steps

Pipelines in Pipelines (Experimental) could work well in this scenario. I just don't know how I feel about having to fill out like 15+ params in one pipeline, I also don't know how painful debugging problems would be when you're dealing with multiple pipelines in one. Otherwise, this seems like the best approach.

have triggers that will kick off subsequent pipelines after the success of earlier ones.

As this StackOverflow answer says: "Tekton indeed doesn't really offers any easy way to trigger one pipeline from another". Most solutions to this I can find involve one of two approaches:

Since both of these approaches involve modifying the underlying pipelines, I would strongly recommend guarding the new final task using a new param (e.g. runNextPipelineOnSuccess) so that a user isn't forced to always run all pipelines when running one.

have a separate outer pipeline that triggers and monitors the existing ones

I'm not sure what the benefits to this approach would be compared to the other two.

LaVLaS commented 1 year ago

I think the test -> rollout should be combined into a single pipeline since the expectation is that any successful test should automatically trigger a push and PR to the gitops repo for final review.

A single pipeline has the benefit of being visually pleasing for a demo to show the end to end workflow for how an intelligent application makes it into production. I do agree that I would not want to wait for a container build pipeline just to debug some issue in the test or rollout steps of the pipeline

piotrpdev commented 1 year ago

Moving discussion about this from Slack to here:

"Hello folks, I have a question in https://github.com/opendatahub-io/ai-edge/issues/92#issuecomment-1747202159 -- basically it seems to me that the "GitOps" pipeline which does not seem to be doing much should be merged to the test-mlflow-image-pipeline pipeline. We would be able to just get the information about the built/pushed container image more easily, plus filing the PR shouldn't really be something that the MLOps engineer initiates manually -- it should be a direct result of successful test and push to Quay. Am I way off the intent of the PoC? I'd appreciate your comments, there or here." - @adelton

adelton commented 1 year ago

I'd like to point out that currently, the rollout pipeline requires so much manual input (see https://github.com/opendatahub-io/ai-edge/issues/92) that its value is significantly diminished.

adelton commented 1 year ago
* What if you want to create a PR in multiple repos to change an image? What if you have separate repos/branches for dev/staging/prod?

We have to walk before we run. This is a PoC. It can be as limited in options / scope as reasonable, but it should be correct in what it is doing. Today the PR filed have nothing in common with all the previous steps that the pipelines/README.md do.

* What if the gitops pipeline fails e.g because an access token expired, you mistyped a url, etc. and you don't want to retest the image

Then we are talking about different pipeline which should as its input take the previous PipelineRun id, or something, to be able to pull information about the container image from it (I assume that can be done, somehow).

Which, frankly, might be the most practical quick-fix solution for now.

* What if you wanted to test the image and upload it to Quay but not necessarily make a PR and change the image being used in your app?

Make the PR-filing Tasks optional, based on some parameters.

But again, this is a PoC, we show what can be done in its entirety, not claiming that it is perfect for every use-case.

* Separation of concerns is useful, especially for development. A bug introduced in the gitops pipeline shouldn't break testing

Unless everything is green, there is some problem there, something unexpected happened. If we are talking about GitOps, testing is just a gating step to CD.

We can by all means have separate smaller Pipelines for people who prefer them, with the manual inputs. But the PoC should show as much automation and using results of the previous steps as possible.

piotrpdev commented 1 year ago

I'd like to point out that currently, the rollout pipeline requires so much manual input (see #92) that its value is significantly diminished.

Sorry, I know you linked the issue but could you be more specific on "requires so much manual input"?

If you mean these params:

https://github.com/opendatahub-io/ai-edge/blob/f148f63a4c4288ad917428e63f6bfeac2f5355a2/pipelines/tekton/gitops-update-pipeline/example-pipelineruns/gitops-update-pipelinerun-bike-rentals.yaml#L8-L20

adelton commented 1 year ago

The target-imagerepo should be reflected into newName, imageDigest should be automatically derived from the last tested and pushed image.

piotrpdev commented 1 year ago

We have to walk before we run. This is a PoC. It can be as limited in options / scope as reasonable, but it should be correct in what it is doing.

I believe by combining those two pipelines into one we will make scaling/expansion in the future unnecessarily hard; no point in combining the pipelines now and separating them again later. If we're trying to be "correct" I think using one of the approaches I suggested will be faster, easier, and more scalable.


Today the PR filed have nothing in common with all the previous steps that the pipelines/README.md do.

I totally agree, we can create an issue for this. The intention was for somebody to copy over whatever image-ref they wanted to use to the pipelinerun "change the parameter values in the PipelineRun definition to match.":

https://github.com/opendatahub-io/ai-edge/blob/f148f63a4c4288ad917428e63f6bfeac2f5355a2/pipelines/README.md?plain=1#L148

However I can see how that's confusing and having an easy copy/paste command would be nice (see below). Ideally this will be automated of course once we resolve this issue.


Then we are talking about different pipeline which should as its input take the previous PipelineRun id, or something, to be able to pull information about the container image from it (I assume that can be done, somehow).

Which, frankly, might be the most practical quick-fix solution for now.

I agree, we could use the openshift-client ClusterTask to get the latest successful testing pipelinerun and get the image-ref from that. I was able to get that to work using this monstrous command (one of many ways to do so):

$ PIPELINERUN_NAME=$(oc get pipelinerun \
  --selector=tekton.dev/pipeline=test-mlflow-image \
  --sort-by=.status.startTime \
  -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.conditions[0].type}{" "}{.status.conditions[0].status}{"\n"}{end}' | \
  awk '$2=="Succeeded" && $3=="True" {print $1}' | \
  tail -n 1) && \
oc get pipelinerun $PIPELINERUN_NAME \
  -o jsonpath='{.status.pipelineSpec.tasks[?(@.name=="skopeo-copy")].params[?(@.name=="destImageURL")].value}'

docker://quay.io/rhoai-edge/bike-rentals-auto-ml:1-3c1e170d-0144-452c-96de-f60aad045a39


Make the PR-filing Tasks optional, based on some parameters.

We could, or we could just keep the pipelines separate ¯\(ツ)/¯ those optional tasks can now mean your testing flow stops working because somebody accidentally introduced a bug in the pipeline when trying to change the git flow.


But again, this is a PoC, we show what can be done in its entirety, not claiming that it is perfect for every use-case.

In that case I still like the use of Pipelines in Pipelines more than combining the steps from multiple pipeline files into one.


Unless everything is green, there is some problem there, something unexpected happened. If we are talking about GitOps, testing is just a gating step to CD.

The world works in mysterious ways, especially when programming is involved.



We can by all means have separate smaller Pipelines for people who prefer them, with the manual inputs. But the PoC should show as much automation and using results of the previous steps as possible.

I 100% agree, but don't think combining all of the steps from every pipeline file into one is a good idea for multiple reasons (listed above).

piotrpdev commented 1 year ago

The target-imagerepo should be reflected into newName

Can you elaborate on this please? I agree they should match ideally, but for example the user could have multiple ACM apps running that use the same quay model but with different digests e.g. an old version and a new version.

Even if we assume that's never going to be the case, what you think we should do? Have the testing pipeline check the repo files to see if they match? Use oc to check where ACM got the image from?

imageDigest should be automatically derived from the last tested and pushed image.

That may not be what the user always wants but it's fair as a default 👍 .

adelton commented 1 year ago

Can you elaborate on this please? I agree they should match ideally, but for example the user could have multiple ACM apps running that use the same quay model but with different digests e.g. an old version and a new version.

The PoC currently does not show running two different apps with the same container image but different version of that image, so it's not like this would prevent demonstrating what the PoC demonstrates today.

But anyway: the PR goes to the repo of the ACM app that typically consumes the new image first, and the user is welcome to later merge that change to the more conservative branch that drives the second ACM app. QE / stage / prod. Whatever.

Even if we assume that's never going to be the case, what you think we should do? Have the testing pipeline check the repo files to see if they match? Use oc to check where ACM got the image from?

If the testing pipeline pushed the container image somewhere, that's the value that the sed substitution should put to the newName. If the values already match, great -- no change will be needed.

piotrpdev commented 1 year ago

If the testing pipeline pushed the container image somewhere, that's the value that the sed substitution should put to the newName. If the values already match, great -- no change will be needed.

Sorry, I'm still not understanding. Are you suggesting that the testing pipeline should change newName in the repo's ACM files to match target-imagerepo?

adelton commented 1 year ago

Overall: any manual step we could remove from the process, we should try to do so. Because doing something manually, be it copying SHA-256's or creating PipelineRuns, is error prone, it depends on the human factor rather than supporting humans by automating what can be automated.

Of the approaches that you suggested, they all seem to require technology or techniques that we currently don't have in the PoC. I'd prefer to fully use what we depend on today before adding something new to the mix.

I see nothing wrong having primarily a single Pipeline in the PoC, plus possibly instructions what bits to remove if the user wants to avoid / skip some steps. Maybe they don't want to test. Maybe they don't want to file the PR. They can always edit it out.

adelton commented 1 year ago

Sorry, I'm still not understanding. Are you suggesting that the testing pipeline should change newName in the repo's ACM files to match target-imagerepo?

Yes. Well, the rollout pipeline via the PR, just like the digest. But in the context of this issue, it's basically the same pipeline.

piotrpdev commented 1 year ago

Of the approaches that you suggested, they all seem to require technology or techniques that we currently don't have in the PoC. I'd prefer to fully use what we depend on today before adding something new to the mix.

Pipelines in Pipelines is extremely simple though, it's like one small file...

adelton commented 1 year ago

Pipelines in Pipelines is extremely simple though, it's like one small file...

The https://github.com/tektoncd/experimental/tree/main/pipelines-in-pipelines starts with

Install and configure ko.

Not gonna happen.

piotrpdev commented 1 year ago

Pipelines in Pipelines is extremely simple though, it's like one small file...

The tektoncd/experimental@main/pipelines-in-pipelines starts with

Install and configure ko.

Not gonna happen.

???

https://github.com/tektoncd/experimental/blob/98b409ee8278601f767990d63c92d18f0b802830/pipelines-in-pipelines/README.md?plain=1#L39-L43

image

adelton commented 1 year ago

Fair enough.

But it is yet another thing that the user would have to ask their cluster-admin to do for them:

$ oc apply --filename https://storage.googleapis.com/tekton-releases-nightly/pipelines-in-pipelines/latest/release.yaml
Error from server (Forbidden): error when retrieving current configuration of:
Resource: "/v1, Resource=namespaces", GroupVersionKind: "/v1, Kind=Namespace"
Name: "tekton-pip-run", Namespace: ""
from server for: "https://storage.googleapis.com/tekton-releases-nightly/pipelines-in-pipelines/latest/release.yaml": namespaces "tekton-pip-run" is forbidden: User "user" cannot get resource "namespaces" in API group "" in the namespace "tekton-pip-run"
Error from server (Forbidden): error when retrieving current configuration of:
Resource: "/v1, Resource=serviceaccounts", GroupVersionKind: "/v1, Kind=ServiceAccount"
Name: "pip-controller", Namespace: "tekton-pip-run"
from server for: "https://storage.googleapis.com/tekton-releases-nightly/pipelines-in-pipelines/latest/release.yaml": serviceaccounts "pip-controller" is forbidden: User "user" cannot get resource "serviceaccounts" in API group "" in the namespace "tekton-pip-run"
Error from server (Forbidden): error when creating "https://storage.googleapis.com/tekton-releases-nightly/pipelines-in-pipelines/latest/release.yaml": clusterroles.rbac.authorization.k8s.io is forbidden: User "user" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
Error from server (Forbidden): error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role"
Name: "pip-controller", Namespace: "tekton-pip-run"
from server for: "https://storage.googleapis.com/tekton-releases-nightly/pipelines-in-pipelines/latest/release.yaml": roles.rbac.authorization.k8s.io "pip-controller" is forbidden: User "user" cannot get resource "roles" in API group "rbac.authorization.k8s.io" in the namespace "tekton-pip-run"
[...]

IIRC, the last thing I had to do as an admin for this pipelines part was to install the Red Hat OpenShift Pipelines operator, which says "provided by Red Hat". For some admins and some clusters, the confinement and separation of duties that the current PoC approach provides might be the dealbreaker.

adelton commented 1 year ago

@grdryn As a requestor of this issue, are you OK closing this one with https://github.com/opendatahub-io/ai-edge/issues/146 superseding it, or so we want to keep the "trigger the follow-on" idea also alive?

grdryn commented 1 year ago

Sure, sounds fine to me. I'll close this one for now then :+1: