[feature] Build and test V2 driver / launcher images against incoming PRs

CarterFendley commented 1 month ago

Feature Area

/area backend /area samples

What feature would you like to see?

The V2 backend driver / launcher images being built and tested against incoming PRs through integration tests / etc.

What is the use case or pain point?

Assure stability of driver / launcher.

Is there a workaround currently?

Trust people to test driver / launcher locally.

More details

Currently, the kfp-cluster action, currently used by the workflows listed below, uses build-images.sh to build a set of images and push to the kind registry.

e2e-test.yml
kfp-kubernetes-execution-tests.yml
kfp-samples.yml <-- My primary focus at the moment
kubeflow-pipelines-integration-v2.yml
periodic.yml
sdk-execution.yml
upgrade-test.yml

The set of images which are built by build-images.sh does not currently include the V2 driver and launcher.

Even if this is changed, there would still be additional work required to assure these built images would be used by the backend during testing. Namely, the backend has defaults for which images to use (see here) which normally point to gcr.io locations. Work would need to be done to override these defaults so that during PR testing, the built images would be used instead of the ones deployed previously on gcr.io.

Discussion of implementation

Updating build-image.sh would likely be pretty straight forward.
The argo compiler accepts V2_DRIVER_IMAGE / V2_LAUNCHER_IMAGE environment variables to override the gcr.io defaults (configured via the deployment.apps/ml-pipeline deployment). @hbelmiro has suggested maybe using a Kustomize layer for updating these during testing.

What about releases?

Although it makes sense to build driver / launcher images and test them during the PRs it may make sense to NOT override the V2_DRIVER_IMAGE / V2_LAUNCHER_IMAGE defaults and test against the gcr.io deployments when validating releases. Since users will be unlikely to override these values and use gcr.io it is reasonable to test in that configuration.

I am not aware of the extent to which kfp-samples.yml (or other workflows consuming the kfp-cluster action) are executed during release processes. Please let me know if others have more info on this :)

Love this idea? Give it a 👍.

CarterFendley commented 1 month ago

Noting that this was surfaced during development of #11196.

We are currently unable to add integration tests assuring the stability of this patch in that PR due to this issue.

Thoughts / feedback very welcome as always!

CarterFendley commented 1 month ago

Information from the 9/25/24 community meeting:

The current process for releases is for the person cutting the release to verify that workflows on the master branch have completed successfully. There was no mention of workflows being executed against release branches. Image updates to GCR are manually executed as well.

Release is documented here.

It seems that there is consensus on building images and testing them in PRs through overrides. Remaining questions are what to do with the master branch / rest of the release processes.

If, for example, a PR such as #11196 was merged in with updated tests to assure that this functionality remains intact these tests will fail with older versions of the driver / launcher. So the options to keep tests passing in the master branch would be:

Also build these images and test them in master branch executions.
Deploy new images to GCR immediately after merge and update the defaults after. a. Or have the defaults such that they don't need to be updated.

The current guidance for releases assume a pattern most similar to the second option. If the first option were taken, we would need to assure that at some point before release the overrides are disabled and workflows executed.

HumairAK commented 1 month ago

Thanks for driving this @CarterFendley !

As far as the PR suggestions are concerned, I think your proposal in OP makes sense, Imo we should just go with setting the env vars.

Also build these images and test them in master branch executions.

Is this not currently the case? If the kfp-cluster action runs post-merge against master branch then build-images.sh is used there as well. So in that scenario the same env vars could be set that reference the kind images.

Granted, if this is the flow then the tests can be a bit misleading since the hard coded images are not what we are testing against.

What if we just reference latest where we hardcode the images, and for releases we just update the images in the manifests (the env vars) instead of the manual step of updating the hardcoded images in code (Step 2 here)

This way you are affectively doing (1), and (2) is handled implicitly if we create automated build/push to main. We can tie this with the newly added workflow that pushes to GHCR here, have this run on PR merge with a push to main and latest. This would require us to treat latest == main (or we hardcode to main, but that might be strange).

CarterFendley commented 1 month ago

Adding a note about the urgency of this change. #11246 introduced a bug which would have been caught by the sample tests if the driver / launcher had been built and tested. Specifically the two_step_pipeline_containerized sample would have shown this, however the sample tests passed on the PR without issue.

@zazulam took a two day side quest finding a solution to the issue 😂

Will work with my team to allow me to pick this up next (hopefully tomorrow).

HumairAK commented 1 month ago

@CarterFendley just to make sure, are you saying the PR introduced a bug in the main branch, or in the context of the PR? if the latter can we prioritize a fix as a separate commit/pr?

Regardless, let us know what we can do to help support you in this effort, we too feel this is very important!

CarterFendley commented 1 month ago

@HumairAK That the PR introduced a bug in the main branch. @zazulam is the expert there, I think he will be on the community call tomorrow and will have better information than me.

Yes the integration testing PR will be separate.

zazulam commented 1 month ago

@HumairAK prior to that PR forcePathStyle was defaulted to true, so for the Bucket URL for a local deployment it would resolve to http://minio_ip:9000/bucket_name/... Now in the driver, it does not pass a value for forcePathStyle when creating the SessionInfo used for the root dag so effectively when it gets resolved downstream it would be false resulting in the format http://bucket_name.minio_ip:port/bucket_name/... which causes the error when attempting to store artifacts. This error arises when building the launcher image off the current master branch. Lmk if you think it should be added as a separate PR.

HumairAK commented 1 month ago

Thanks folks @CarterFendley / @zazulam!

Lmk if you think it should be added as a separate PR.

Yes it sounds like a separate issue which we should fix asap if possible, I've created the following issue to track it: https://github.com/kubeflow/pipelines/issues/11280#issue-2575899093

Let us know if you guys can take this, we can assign it.

CarterFendley commented 1 month ago

Okay so the PR above starts work on this.

Long term we should probably be using something different than kubectl patch commands but this gets things working to stop any regressions for the moment.

Looping back to process / design comments raised by @HumairAK, I don't mind the idea of having latest images published after merge, and subsequent testing. Would still need to override during PR testing, and disable those overrides during post-merge testing, which is totally do-able.

Also, slightly curious about trying to push build-images.sh and image-builds.yml together, maybe would be nice to have one solution for this. Have already at least 4 different ways to build launcher and driver images 😅

kubeflow / pipelines