kubeflow / website

Kubeflow's public website
Creative Commons Attribution 4.0 International
145 stars 752 forks source link

Implement a Reusable E2E Kubeflow ML Lifecycle #3728

Closed andreyvelich closed 3 weeks ago

andreyvelich commented 2 months ago

Based on our recent discussion with @franciscojavierarceo I updated the ML lifecycle diagram in the architecture guides: https://github.com/kubeflow/website/pull/3719#discussion_r1578194240 We can re-use this ML lifecycle diagram in each Kubeflow Component and explain the user value of that component.

I like the existing diagrams, but they little bit out of date. I am happy to improve my diagrams based on your feedback.

Also, I removed unused images.

/assign @franciscojavierarceo @kubeflow/kubeflow-steering-committee @thesuperzapper @StefanoFioravanzo @hbelmiro

/hold for review

StefanoFioravanzo commented 2 months ago

@andreyvelich What exactly are you trying to accomplish? I didn't fully get this part

We can re-use this ML lifecycle diagram in each Kubeflow Component and explain the user value of that component.

Is this diagram supposed to be re-used by each component, and if so, how do you envision that?

andreyvelich commented 2 months ago

@andreyvelich What exactly are you trying to accomplish? I didn't fully get this part

We can re-use this ML lifecycle diagram in each Kubeflow Component and explain the user value of that component.

Is this diagram supposed to be re-used by each component, and if so, how do you envision that?

That's right. Please check these examples:

We can do the same for Model Registry, Spark Operator, Notebooks if other WGs agree with that.

What do you think about it @StefanoFioravanzo ?

StefanoFioravanzo commented 2 months ago

Oh Ok now I understand your approach, I like this. You are proposing we build a canonical Kubeflow ML lifecycle diagram and then highlight what parts of the diagram each component covers.

So, based on this, I propose two things:

  1. rename this PR to better represent what we are doing (e.g. Implement a reusable E2E ML lifecycle diagram or something like that)
  2. Consider using and adapting an existing diagram. There are many E2E ML lifecycle diagram in the open source, widely used and promoted by large oragnizations. One option is to overlay Kubeflow and Kubeflow components on top of one of these

If you want to keep the focus smaller and have a quicker iteration on the existing diagram, I am fine with it and you can ignore the two points above.

StefanoFioravanzo commented 2 months ago

cc @Chasecadet can probably provide some good insight on this

StefanoFioravanzo commented 2 months ago

@andreyvelich a very good open source diagram that we can reuse is this one by the AI Infrastructure Alliance. See here https://github.com/ai-infrastructure-alliance/blueprints

There is no explicit license, by the do write in the README:

Please retain the AIIA Logo on the diagrams when you use them, otherwise you are free to modify them in any way you see fit.

I think this would be a pretty good starting point for a reusable diagram. They have an editable figma file, and even an interactive version. Take a look at all the folders, there's various versions.

We could fork the repository under the Kubeflow org and adapt it to the various component. If we want we could embed the interactive diagram in our website. If we are unsure about licensing and reusability of that content, I can reach out to a couple of folks at AIIA.

StefanoFioravanzo commented 2 months ago

I can see us doing something similar to this interactive version https://ai-infrastructure-alliance.github.io/blueprints/interactive-stack-diagram/stack.html where each option is one of the Kubeflow components. So you can see how the entire Kubeflow platform (we can have a "all" picker) covers the E2E ML lifecycle or based on your a-la-carte choice

andreyvelich commented 1 month ago

rename this PR to better represent what we are doing (e.g. Implement a reusable E2E ML lifecycle diagram or something like that)

That makes sense, renamed it.

andreyvelich commented 1 month ago

If you want to keep the focus smaller and have a quicker iteration on the existing diagram, I am fine with it and you can ignore the two points above.

To be honest, I have concerns with existing diagram, since it was implemented ~ 5 years ago which is very out-of-date. E.g. it doesn't include model fine-tuning which is the modern approach for model development, and it doesn't have online feature store. WDYT @StefanoFioravanzo @franciscojavierarceo ?

andreyvelich commented 1 month ago

a very good open source diagram that we can reuse is this one by the AI Infrastructure Alliance. See here https://github.com/ai-infrastructure-alliance/blueprints

I like there diagrams, but it looks similar to what we have in this PR, isn't ?

E.g. the differences:

Maybe we can improve our diagram with additional stages ? WDYT @franciscojavierarceo @StefanoFioravanzo

franciscojavierarceo commented 1 month ago

I can see us doing something similar to this interactive version https://ai-infrastructure-alliance.github.io/blueprints/interactive-stack-diagram/stack.html where each option is one of the Kubeflow components. So you can see how the entire Kubeflow platform (we can have a "all" picker) covers the E2E ML lifecycle or based on your a-la-carte choice

I agree the old diagram is outdated.

I am much more preferential to a diagram that reflects the view of a Data Scientist and the needs in their workflow, which the diagram you proposed does. The AI Infrastructure Aliiance I think highlights things in a way that highlights the needs for different companies with different structure and, while that's helpful, I don't think that elicits clarity on the value of Kubeflow.

chasecadet commented 1 month ago

@StefanoFioravanzo finally getting to this! Before I say too much I'd like to take a step back because as we allll know "tactics without vision is just noise before defeat". I like the idea of an ML diagram. I would love to know what our vision for these documents is and how we are approaching this. Someone reads the diagram they learn X and then start building using Y and deliver Z value to their project/org.

Allow me to free associate here a bit on what I think would be interesting. I like the idea of talking about use cases for specific components, but I struggle with the idea of telling users what to do. I want to help them envision using these tools and enable them to creatively solve solutions. Another way to say this is I would love if the users told us what they use these components for in collaboration with our vision for these components. We as a community can provide guidance. If we act as a ground truth authority on use cases we might lose out on the value of new community members using the tools in powerful but unexpected ways we can later integrate into more robust use cases.

Questions I'd love to have answers to are:

We can touch on trying to say use KFP without a training operator to attempt to run an XGBOOST job vs using and integrating the training operator to show that you "can" do things in MANY ways but may lose out on overall value trying to redo our engineering efforts through your own means..

That being said, stands on soap box I love calling out the model development lifecycle according to this community and placing components within that lifecycle as suggestions. Some are more concrete than others (you can't use Kserve to train a model) but also showing that we have a flexible, composable, and integrated solution you can port anywhere to run MLOPs at scale. I think @jbottum said it very well in that the power of KF is more than just our components but the community. As we grow we benefit from continuing to demonstrate the tribal community knowledge we are building and sharing with the world so teams can "Go with the Kubeflow" knowing they are part of a community that is writing code with a purpose using learning from many orgs, communities, and perspectives to build a world class MLOPs solution vastly democratizing access to ML/AI across the industry. Showing others "What's in it for them" using KF will bring them into the community and ensure it stays healthy and fuel the next generation of contributors as we go from incubation to graduation and beyond. hops off soap box

Maybe I missed the point of the CC. I also have a chapter in that class I built on the model dev lifecycle. I officially own the content and we can use it how we see fit to create some MLOPs like documents.

andreyvelich commented 1 month ago

@StefanoFioravanzo @franciscojavierarceo I've made a few updates to the lifecycle diagram based on the feedback. Does it look good to you ? I think, we can merge this PR before Kubeflow 1.9 release.

franciscojavierarceo commented 1 month ago

@StefanoFioravanzo @franciscojavierarceo I've made a few updates to the lifecycle diagram based on the feedback. Does it look good to you ? I think, we can merge this PR before Kubeflow 1.9 release.

Looks great!

andreyvelich commented 1 month ago

/hold cancel

andreyvelich commented 1 month ago

I updated the main page to re-use the same statement as we added to the introduction section: https://www.kubeflow.org/docs/started/introduction/

StefanoFioravanzo commented 4 weeks ago

@andreyvelich thanks for driving this. I think it looks pretty good and we should go ahead and merge this.

For the future, I still think we should try and create more polished and beautiful diagrams. Having an interactive version where we can toggle components and third party integrations as I mentioned above would be pretty cool. I'd love to play around with a fork of the AIIA templates. Maybe in the near future I'll find some time :)

StefanoFioravanzo commented 4 weeks ago

/lgtm

rimolive commented 4 weeks ago

/lgtm

@andreyvelich Given the number of lgtm's can we work on approvals to merge this PR?

andreyvelich commented 4 weeks ago

Thanks for the review @thesuperzapper!

I addressed your comments. /assign @StefanoFioravanzo @kubeflow/release-managers @thesuperzapper @kubeflow/kubeflow-steering-committee

andreyvelich commented 3 weeks ago

@thesuperzapper @StefanoFioravanzo @franciscojavierarceo @hbelmiro I removed changes from the start page in this PR, I will create separate PR to update it. Are we ready to merge this PR ?

vikas-saxena02 commented 3 weeks ago

@andreyvelich my two cents:

Happy to help with making the changes if you need some help.

andreyvelich commented 3 weeks ago

The architecture diagram under Kubeflow Ecosystem that we modified previously as part of my PR should be updated to include SparkOperator as other diagrams as part of this PR have it. I think the right big blok under Integrations is the best. Other Option will be to put it under Kubeflow Components

We will include Spark Operator + Model Registry in this diagram once we make the first official release for these components.

chasecadet commented 3 weeks ago

I'm just adding some details here. I have a ton of content around the ML Lifecycle we can use from the course, and it's free. I own it. https://docs.google.com/document/d/1t2gTTQolI7DfLQJUbhSqd8bxhrIVqZOIU8dKGiTrHoo/edit?usp=sharing @andreyvelich @StefanoFioravanzo, feel free to take a look and see what we can use. I included model monitoring as part of serving and also mentioned model retiring.

chasecadet commented 3 weeks ago

also @andreyvelich keep me posted on this. I can update the course with our official ML lifecycle as well as updated architecture diagrams.

andreyvelich commented 3 weeks ago

I'm just adding some details here. I have a ton of content around the ML Lifecycle we can use from the course, and it's free. I own it. https://docs.google.com/document/d/1t2gTTQolI7DfLQJUbhSqd8bxhrIVqZOIU8dKGiTrHoo/edit?usp=sharing @andreyvelich @StefanoFioravanzo, feel free to take a look and see what we can use. I included model monitoring as part of serving and also mentioned model retiring.

That's great @chasecadet, it would be nice if you could present it sometime in one of our communities call and collect the feedback.

andreyvelich commented 3 weeks ago

@franciscojavierarceo @thesuperzapper @vikas-saxena02 @chasecadet @StefanoFioravanzo @hbelmiro @kubeflow/kubeflow-steering-committee I think, we can merge this PR if you don't have any strong objections. As @franciscojavierarceo and I said before we can always iterate on our architecture page to better explain the value of Kubeflow components.

vikas-saxena02 commented 3 weeks ago

@andreyvelich no strong objection. Just another recommendation to add the CNCF paper as a reference.

vikas-saxena02 commented 3 weeks ago

/approve

thesuperzapper commented 3 weeks ago

@andreyvelich While we can always make improvements (and I am sure we will in future PRs) this update is a significant improvement to the architecture page and I think it's worth merging now.

/lgtm

@andreyvelich you will probably need to approve this, as it needs a root approver given the number of files changed.

google-oss-prow[bot] commented 3 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, franciscojavierarceo, vikas-saxena02

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/website/blob/master/OWNERS)~~ [andreyvelich] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment