kubeflow / manifests

A repository for Kustomize manifests
Apache License 2.0
821 stars 884 forks source link

[kubeflow 1.8] Kubeflow 1.8 Tracking Issue #2442

Closed DnPlas closed 1 year ago

DnPlas commented 1 year ago

This issue will provide high level updates of Kubeflow 1.8 release.

TODO:

cc: @kubeflow/release-team @jbottum

helloericsf commented 1 year ago

After evaluating our engineering roadmap and priorities, the BentoML team has decided to pause the integration with Kubeflow Pipelines in the 1.8 release. We value our collaboration with Kubeflow and apologize for any disruption this causes. We believe pausing the integration temporarily is the right decision to ensure we deliver quality features. We look forward to resuming our work together in future releases. I wanted to let you know about this decision openly and transparently. Please reach out if you have any questions or concerns. As the serving work group liaison, I do not anticipate any changes in my role or responsibilities. cc @DnPlas

DnPlas commented 1 year ago

Hi community, as we approach our feature freeze (Aug 2nd) I think it is worth to ask about anything that you folks think will require more time to be completed before that date. The release team liaisons have been doing an excellent job at communicating with WG leads, but I extend the question to the rest of the community.

cc: @kubeflow/wg-automl-leads @kubeflow/wg-manifests-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-pipeline-leads @jbottum

juliusvonkohout commented 1 year ago

@tzstoyanov will commit the istio 1.18 upgrade in https://github.com/kubeflow/manifests/pull/2455 tomorrow and i hope that the other additional rootless stuff will not be relevant for the feature freeze as written down in the PR description.

DnPlas commented 1 year ago

@adriangonz FYI, just so you have the dates and all information about the upcoming release.

kimwnasptd commented 1 year ago

Hey @DnPlas, I'd like to sincerely ask that we have a 1 week delay. The situation with Notebooks WG is the following:

  1. I'd really like to include https://github.com/kubeflow/kubeflow/pull/7179, since it's the last piece to include @TobiasGoerke's effort
  2. There's an issue with our CI and it's failing to build images, which we are tracking in https://github.com/kubeflow/notebooks/issues/82

The plan I have in mind for Notebooks WG is to evaluate as soon as possible how can we unblock the CI and do the review on the PVCViewer integration. I'm expecting this to take until this Friday, 4th August.

Then on Monday we cut our v1.8-branch in the kubeflow/kubeflow repo and make some small PRs we need to build all the images and update the manifests. I'd expect this to take 1 day, even though the PRs are small due to the async communication.

Lastly, these are the PRs that I'd also want to finalise during the feature freeze, but am OK to not delay the release for those and cherry-pick afterwards:

DnPlas commented 1 year ago

ACK @kimwnasptd, I'll share this information with the release team.

kimwnasptd commented 1 year ago

@DnPlas @NohaIhab from Notebooks WG side we've:

  1. Merged the PR we wanted for the Volumes UI https://github.com/kubeflow/kubeflow/pull/7179
  2. Fixed the CI https://github.com/kubeflow/kubeflow/pull/7231

We'll proceed the next couple of days now with cutting the release branch and updating our manifests for the RC

DnPlas commented 1 year ago

@kimwnasptd thanks for the update, please keep the team posted as we are planning to finish the manifest sync next week (Wednesday).

cc: @NohaIhab

yhwang commented 1 year ago

@DnPlas I just created a PR to update the Kubeflow Tekton Pipelines manifest to 2.0.0: kubeflow/kubeflow#2500 cc @Tomcli

DnPlas commented 1 year ago

Hi folks, I would like to announce that the Kubeflow 1.8 RC.0 is out 🎉 and that we have started with manifest testing. We expect to finish this process by the end of this week (September 15th).

I'd like to encourage community members to start testing the release and provide feedback, as well as file issues if any. I would also like to remind all Distribution owners that once Manifest Testing ends we will begin Distribution Testing on September 15th (as soon as we have the results of manifest testing). Please get your Distributions and infrastructure ready for that stage.

I also want to take the opportunity to thank all the community members who have helped with getting to this stage. Let's keep working toward a successful release!

DnPlas commented 1 year ago

The release team is happy to announce that we have released Kubeflow 1.8 RC.1.

This is now the time for Distributions to start with their testing. The release team kindly asks for feedback by the EOW next week. Feel free to submit issues and comment on the various WGs repositories.

I'd like to encourage community members to start testing the release and provide feedback, as well as file issues if any.

I also want to take the opportunity to thank all the community members who have helped with getting to this stage. Let's keep working toward a successful Release!

pmuilu commented 1 year ago

So is it so that kfp v2 will be still broken with KF 1.8? (At least #8733 is still open)

DnPlas commented 1 year ago

So is it so that kfp v2 will be still broken with KF 1.8? (At least #8733 is still open)

hey @chensun @zijianjoy pinging you folks for getting more accurate information. Do you think this issue deserves more attention?

kromanow94 commented 1 year ago

Hello, I have a few issues with 1.8 based on my tests on Kubeflow 1.8 RC.1:

image

image

In general I feel that previous version of the KF Pipeline UI was more informative. For example, if the Pods' Step was Pending, it was possible to see the reason of Pending state in the Pipeline Run Page.

Is there a way to enable back the main-logs artifact?

Davidnet commented 1 year ago

Tagging @Linchin for visibility, I don't know if we should create an issue in kfp, but I think kfp is working as expected ? Let me know what you think

kromanow94 commented 1 year ago

@Davidnet from perspective of running the Steps in the Pipeline Run, I think kfp is working as expected. It's more about the dashboard, although I'm not sure if the exception Cannot get MLMD objects from Metadata store is expected. Is the Pipeline Run dashboard maintained by the Pipelines WG or another one?

Linchin commented 1 year ago

Hi @kromanow94, thank you for your feedback. I haven't completed testing rc.1 yet, but I think with my rc.0 deployment I could answer some of your questions.

If the Pipeline Run Steps are Pending, there is an error shown Cannot get MLMD objects from Metadata store.

I have reproduced this error and I will investigate further into it.

The Pipeline Run Page doesn't show the main-logs

This is a v1 feature that is not implemented in v2. I have created an issue in the KFP repo about this.

it's not possible to access the input and output artifacts due to RBAC Access Denied error.

I haven't been able to reproduce this on an rc.0 deployment, but I will double check on rc.1.

thesuperzapper commented 1 year ago

@kimwnasptd we need to make sure that ARM support gets merged before the final 1.8 RC:


Also, given how many issues and pending PRs are not going to make it for Kubeflow 1.8, I propose we plan to decouple versions of the kubeflow/kubeflow repo components from the overall Kubeflow 1.X versions.

This will allow us to cut a 1.9.0 release (of the Notebooks WG components) with some of the important fixes/features without waiting literally months. Anyone interested can discuss this proposal here:

DnPlas commented 1 year ago

Hi @kromanow94, thank you for your feedback. I haven't completed testing rc.1 yet, but I think with my rc.0 deployment I could answer some of your questions.

If the Pipeline Run Steps are Pending, there is an error shown Cannot get MLMD objects from Metadata store.

I have reproduced this error and I will investigate further into it.

The Pipeline Run Page doesn't show the main-logs

This is a v1 feature that is not implemented in v2. I have created an issue in the KFP repo about this.

it's not possible to access the input and output artifacts due to RBAC Access Denied error.

I haven't been able to reproduce this on an rc.0 deployment, but I will double check on rc.1.

hey @Linchin , thanks for your reply. Should we consider this issue as a blocker for the release? If so, should we expect an RC2 from the pipelines WG to fix this?

kimwnasptd commented 1 year ago

Regarding Notebooks, we are very close to merging the following 2 and would like to ask we wait one day tops to get those in.

yhwang commented 1 year ago

@DnPlas I'd like to update kfp-tekton from 2.0.0 to 2.0.1 and here is the PR: kubeflow/kubeflow#2545 . Thanks!

DnPlas commented 1 year ago

Thanks for the update @kimwnasptd, @yhwang ! I'll wait for those two items to cut the next RC.

DnPlas commented 1 year ago

During testing, one of our users also ran into https://github.com/kubeflow/kubeflow/issues/7273. Leaving this comment for future reference as it should be fixed for the 1.8 release. cc: @kimwnasptd @NohaIhab

DnPlas commented 1 year ago

Hi folks, as we are starting the bug fixing phase of the release, I'd like to make a quick update on the status.

The date when we release RC2 depends directly on the readiness of the above. Could we have an update for the pending PR in kubeflow/kubeflow? @kimwnasptd @thesuperzapper

juliusvonkohout commented 1 year ago

Maybe https://github.com/kubeflow/kubeflow/pull/7322 is interesting as well as bugfix.

kimwnasptd commented 1 year ago

Hey @DnPlas, the ARM PR will not be cherry-picked for this RC since the build fails on merged PRs.

For https://github.com/kubeflow/kubeflow/pull/7310 I'll try to update the PR today

DnPlas commented 1 year ago

Thanks @kimwnasptd, about the https://github.com/kubeflow/kubeflow/pull/7220 feature, should we expect this to be merged into the 1.8 branch at some point in the next two weeks or is it not going to be at all included in the release?

chensun commented 1 year ago

https://github.com/kubeflow/pipelines/issues/8733 is not a block for the release. We're good from Pipelines side.

DnPlas commented 1 year ago

Thanks for the update @chensun !

juliusvonkohout commented 1 year ago

@chensun will https://github.com/kubeflow/pipelines/pull/9946 be backported for 1.8 ? Otherwise we cannot run v2 pipelines as non-root.

chensun commented 1 year ago

@chensun will kubeflow/pipelines#9946 be backported for 1.8 ? Otherwise we cannot run v2 pipelines as non-root.

I plan to cut a KFP release (2.0.2) today, and we can include that into KF 1.8

chensun commented 1 year ago

@DnPlas

KFP 2.0.2 tag is out: https://github.com/kubeflow/pipelines/releases/tag/2.0.2 Can we pull its manifests into the next RC? Thanks!

DnPlas commented 1 year ago

Will do, thanks @chensun !

sachdevayash1910 commented 1 year ago

it's not possible to access the input and output artifacts due to RBAC Access Denied error.

I have been working on setting up 1.8-rc1 as well and have observed the same things @kromanow94 mentioned above.

Screen Shot 2023-10-17 at 2 39 46 PM

In addition to this, in KF <=1.7 we could also see pod ids when we clicked on a component. This was extremely helpful from a debugging and monitoring point of view. However I don't see this anymore

If this is intentional, is there some other way of getting this info? Happy to raise an issue if needed. For reference:

Screen Shot 2023-10-17 at 2 38 13 PM
juliusvonkohout commented 1 year ago

@sachdevayash1910 you need to be on the latest 1.8 branch. There is a bug in the KFP profile controller, where they forgot to add the serviceaccount of pipelines-ui to their rolebindings. We hacked this into the normal profile controller, but it might not be in RC1.

paravatha commented 1 year ago

I am seeing RBAC: access denied even on notebooks page when I try to open a notebook instance notebook/{user}/test-vscode/

I deployed on on EKS 1.26 using https://github.com/kubeflow/manifests/tree/v1.8-branch with oauth2-proxy

thesuperzapper commented 1 year ago

@paravatha which oauth2-proxy manifests are you using?

The issue will probably be related to the oauth2-proxy manifests not being updated yet to support the new security features of ensuring that only the istio gateway can talk to the notebook servers (preventing in-cluster access hacking).

See here for more info: https://github.com/kubeflow/kubeflow/pull/7310

I am not sure who is responsible for maintaining the ouath2-proxy manifests, because they are not technically officially "released" yet.

paravatha commented 1 year ago

@thesuperzapper just the alternate manifests in the same branch mentioned here https://github.com/kubeflow/manifests/tree/v1.8-branch#authservice and here https://github.com/kubeflow/manifests/tree/v1.8-branch#dex

DnPlas commented 1 year ago

hey @paravatha, we identified that issue in the previous RC. Could you please try with RC2?

Please refer to https://github.com/kubeflow/kubeflow/pull/7310 for more information on the issue.

DnPlas commented 1 year ago

Hi folks,

The release team is happy to announce that we have released Kubeflow 1.8 RC.2.

As we are approaching the release date on October 25th, I'd like to encourage community members to continue testing the release and provide feedback.

paravatha commented 1 year ago

Hi @DnPlas I tested using https://github.com/kubeflow/manifests/tree/v1.8-branch which has 1.8 rc2 changes. It seems to be that there are 2 places RBAC: access denied is happening

  1. on pipelines page (this may have been fixed in rc2, I have not seen it)
  2. on notebooks page (I came across this in rc2, not sure of others encountered the same issue)
chensun commented 1 year ago

on pipelines page (this may have been fixed in rc2, I have not seen it)

Just tested rc2, I don't see such issue.

kimwnasptd commented 1 year ago

Hey folks, I've seen those 2 issues in 1.8.0-rc.2 and we are working on them:

  1. https://github.com/kubeflow/kubeflow/issues/7373
  2. https://github.com/kubeflow/kubeflow/issues/7374
thesuperzapper commented 1 year ago

I need a root approver for the website to approve this one, as it required updating hugo:

DnPlas commented 1 year ago

Hi community,

As discussed in yesterday's community meeting, the following is expected to happen:

  1. Kubeflow 1.8 GA release date Nov 1st
  2. Kubeflow 1.8 RC3 and RC4 are expected to be ready today
  3. Distributions and users are expected to test with RC4, but if there is any issue related to the recently added images, we should roll back to RC3 and use that for the actual release.

As always, we encourage the community to keep testing and providing feedback. Thanks everyone for your contributions!

thesuperzapper commented 1 year ago

@DnPlas there was a slight CI issue with the tagging, which prevented the tags for notebook images from being created for 1.8.0-rc.4, lets quickly resolve it by merging https://github.com/kubeflow/kubeflow/pull/7386, and then cutting 1.8.0-rc.5.

TristanGreathouse commented 1 year ago

I've been experimenting with 1.8-rc2 and kfp=2.3.0 and have run into several issues that are blockers for me and may be for others as well. A list of them is as follows:

  1. Unable to use .after() when referencing a component within a ParallelFor. Issue with more details here.
  2. Cannot assign variables from kubernetes metadata to be environment variables as we could with kfp=1.8.21 and earlier backend versions. Issue with more details here.
  3. Cannot assign dynamic node_selector, cpu/memory requests/limits as we could with kfp=1.8.21 and earlier backend versions. Issue with more details here.
  4. Unable to pass data artifacts from parent node outside a ParallelFor to a child node within a ParallelFor. Issue with more details here.

Happy to provide more info if any of these are not clear!

DnPlas commented 1 year ago

Hi folks,

Just so you know, RC4 is ready!

DnPlas commented 1 year ago

I've been experimenting with 1.8-rc2 and kfp=2.3.0 and have run into several issues that are blockers for me and may be for others as well. A list of them is as follows:

1. Unable to use `.after()` when referencing a component within a `ParallelFor`.  [Issue](https://github.com/kubeflow/pipelines/issues/10050) with more details here.

2. Cannot assign variables from kubernetes metadata to be environment variables as we could with `kfp=1.8.21` and earlier backend versions. [Issue](https://github.com/kubeflow/pipelines/issues/10155) with more details here.

3. Cannot assign dynamic node_selector, cpu/memory requests/limits as we could with `kfp=1.8.21` and earlier backend versions. [Issue](https://github.com/kubeflow/pipelines/issues/10154) with more details here.

4. Unable to pass data artifacts from parent node outside a `ParallelFor` to a child node within a `ParallelFor`. [Issue](https://github.com/kubeflow/pipelines/issues/10149) with more details here.

Happy to provide more info if any of these are not clear!

Hi @TristanGreathouse, thanks for the feedback and for filing those issues. It sounds like a SDK issue, but I'd suggest you try RC4 to make sure you have received the latest version of pipelines (2.0.2).

Soft ping to @chensun as he is the WG lead.

chensun commented 1 year ago

I've been experimenting with 1.8-rc2 and kfp=2.3.0 and have run into several issues that are blockers for me and may be for others as well. A list of them is as follows:

  1. Unable to use .after() when referencing a component within a ParallelFor. Issue with more details here.
  2. Cannot assign variables from kubernetes metadata to be environment variables as we could with kfp=1.8.21 and earlier backend versions. Issue with more details here.
  3. Cannot assign dynamic node_selector, cpu/memory requests/limits as we could with kfp=1.8.21 and earlier backend versions. Issue with more details here.
  4. Unable to pass data artifacts from parent node outside a ParallelFor to a child node within a ParallelFor. Issue with more details here.

Happy to provide more info if any of these are not clear!

Thank you @TristanGreathouse for the detailed issues.