Open CarterFendley opened 1 month ago
Noting that this was surfaced during development of #11196.
We are currently unable to add integration tests assuring the stability of this patch in that PR due to this issue.
Thoughts / feedback very welcome as always!
Information from the 9/25/24 community meeting:
The current process for releases is for the person cutting the release to verify that workflows on the master branch have completed successfully. There was no mention of workflows being executed against release branches. Image updates to GCR are manually executed as well.
Release is documented here.
It seems that there is consensus on building images and testing them in PRs through overrides. Remaining questions are what to do with the master
branch / rest of the release processes.
If, for example, a PR such as #11196 was merged in with updated tests to assure that this functionality remains intact these tests will fail with older versions of the driver / launcher. So the options to keep tests passing in the master
branch would be:
master
branch executions.The current guidance for releases assume a pattern most similar to the second option. If the first option were taken, we would need to assure that at some point before release the overrides are disabled and workflows executed.
Thanks for driving this @CarterFendley !
As far as the PR suggestions are concerned, I think your proposal in OP makes sense, Imo we should just go with setting the env vars.
Also build these images and test them in master branch executions.
Is this not currently the case? If the kfp-cluster action runs post-merge against master branch then build-images.sh is used there as well. So in that scenario the same env vars could be set that reference the kind images.
Granted, if this is the flow then the tests can be a bit misleading since the hard coded images are not what we are testing against.
What if we just reference latest
where we hardcode the images, and for releases we just update the images in the manifests (the env vars) instead of the manual step of updating the hardcoded images in code (Step 2 here)
This way you are affectively doing (1), and (2) is handled implicitly if we create automated build/push to main. We can tie this with the newly added workflow that pushes to GHCR here, have this run on PR merge with a push to main
and latest
. This would require us to treat latest
== main
(or we hardcode to main
, but that might be strange).
Adding a note about the urgency of this change. #11246 introduced a bug which would have been caught by the sample tests if the driver / launcher had been built and tested. Specifically the two_step_pipeline_containerized
sample would have shown this, however the sample tests passed on the PR without issue.
@zazulam took a two day side quest finding a solution to the issue 😂
Will work with my team to allow me to pick this up next (hopefully tomorrow).
@CarterFendley just to make sure, are you saying the PR introduced a bug in the main branch, or in the context of the PR? if the latter can we prioritize a fix as a separate commit/pr?
Regardless, let us know what we can do to help support you in this effort, we too feel this is very important!
@HumairAK That the PR introduced a bug in the main branch. @zazulam is the expert there, I think he will be on the community call tomorrow and will have better information than me.
Yes the integration testing PR will be separate.
@HumairAK prior to that PR forcePathStyle
was defaulted to true, so for the Bucket URL for a local deployment it would resolve to http://minio_ip:9000/bucket_name/...
Now in the driver, it does not pass a value for forcePathStyle
when creating the SessionInfo
used for the root dag so effectively when it gets resolved downstream it would be false resulting in the format http://bucket_name.minio_ip:port/bucket_name/...
which causes the error when attempting to store artifacts. This error arises when building the launcher image off the current master branch. Lmk if you think it should be added as a separate PR.
Thanks folks @CarterFendley / @zazulam!
Lmk if you think it should be added as a separate PR.
Yes it sounds like a separate issue which we should fix asap if possible, I've created the following issue to track it: https://github.com/kubeflow/pipelines/issues/11280#issue-2575899093
Let us know if you guys can take this, we can assign it.
Okay so the PR above starts work on this.
Long term we should probably be using something different than kubectl patch
commands but this gets things working to stop any regressions for the moment.
Looping back to process / design comments raised by @HumairAK, I don't mind the idea of having latest
images published after merge, and subsequent testing. Would still need to override during PR testing, and disable those overrides during post-merge testing, which is totally do-able.
Also, slightly curious about trying to push build-images.sh and image-builds.yml together, maybe would be nice to have one solution for this. Have already at least 4 different ways to build launcher and driver images 😅
Feature Area
/area backend /area samples
What feature would you like to see?
The V2 backend driver / launcher images being built and tested against incoming PRs through integration tests / etc.
What is the use case or pain point?
Assure stability of driver / launcher.
Is there a workaround currently?
Trust people to test driver / launcher locally.
More details
Currently, the kfp-cluster action, currently used by the workflows listed below, uses build-images.sh to build a set of images and push to the kind registry.
e2e-test.yml
kfp-kubernetes-execution-tests.yml
kfp-samples.yml
<-- My primary focus at the momentkubeflow-pipelines-integration-v2.yml
periodic.yml
sdk-execution.yml
upgrade-test.yml
The set of images which are built by
build-images.sh
does not currently include the V2 driver and launcher.Even if this is changed, there would still be additional work required to assure these built images would be used by the backend during testing. Namely, the backend has defaults for which images to use (see here) which normally point to
gcr.io
locations. Work would need to be done to override these defaults so that during PR testing, the built images would be used instead of the ones deployed previously ongcr.io
.Discussion of implementation
build-image.sh
would likely be pretty straight forward.V2_DRIVER_IMAGE
/V2_LAUNCHER_IMAGE
environment variables to override thegcr.io
defaults (configured via thedeployment.apps/ml-pipeline
deployment). @hbelmiro has suggested maybe using a Kustomize layer for updating these during testing.What about releases?
Although it makes sense to build driver / launcher images and test them during the PRs it may make sense to NOT override the
V2_DRIVER_IMAGE
/V2_LAUNCHER_IMAGE
defaults and test against thegcr.io
deployments when validating releases. Since users will be unlikely to override these values and usegcr.io
it is reasonable to test in that configuration.I am not aware of the extent to which
kfp-samples.yml
(or other workflows consuming thekfp-cluster
action) are executed during release processes. Please let me know if others have more info on this :)Related slack thread: https://cloud-native.slack.com/archives/C073N7BMLB1/p1727104197895549
Love this idea? Give it a 👍.