kubeflow / community

Information about the Kubeflow community including proposals and governance information.
Apache License 2.0
160 stars 220 forks source link

WG Data proposal #673

Open tarilabs opened 11 months ago

tarilabs commented 11 months ago

I'm following up on action item: raise WG proposal to Kubeflow per yesterday's Model Registry meeting (recording timestamp).

As discussed in KF community meeting.

Main links:

👉 I'm starting to raise a draft PR in order to "seed/bootstrap" the work in raising the request to form the WG--using a draft PR give us a branch we can collaborate on between stakeholders @andreyvelich @Tomcli @dhirajsb @rimolive

This also give us a medium we can keeps-tab-on so to report back on progress during Tuesdays' community plenary meetings, wdyt?

thesuperzapper commented 11 months ago

I am very strongly opposed to using the name WG-Lifecycle, because that implies that the working group is related to the lifecycle of Kubeflow itself.

My proposal for the name is: WG-Data

Where "data" can mean both actual data (spark) and metadata (model registry). We can also split it up in the future, if the members who are maintaining these components diverge.

tarilabs commented 11 months ago

My proposal for the name is: WG-Data

very well noted @thesuperzapper , as also marked here: https://github.com/kubeflow/community/pull/673/files#diff-11b55409b3d27f083915bd4b910672caaf0e9550cf34d77fe76e8b6b9515023dR524

I just wanted to have a branch where to start collecting this kind of feedback in a non-sparse way and also to report back to you and the group on the progress on Tuesday meetings.

dhirajsb commented 11 months ago

@thesuperzapper how about we make it more explicit WG ML Model Data?

thesuperzapper commented 11 months ago

As it currently stands, this WG does not meet the requirement for diverse leadership given all chairs come from one company (IBM - which owns RedHat).

dhirajsb commented 11 months ago

@thesuperzapper Andrey is listed as a Chair, he's from Apple

tarilabs commented 11 months ago

noticing only now it was not marked as Draft PR despite being my intent:

using a draft PR give us a branch we can collaborate on

my sincerest apologies.

Marked as Draft PR per original message in thead.

rimolive commented 11 months ago

@thesuperzapper Is there a minimum number of companies to compose the chair to make the WG eligible?

thesuperzapper commented 11 months ago

While there is no specific number requirement, the steering comity must approve the new WG (currently, @jbottum @james-jwu) in line with the community's interests. I would expect at least some concern with having 4 leads from one company and only 1 from another.

For reference, here is the lifecycle and other info about forming a working group:

Also, there are only meant to be 2-3 chairs, some other WGs have more, but in most cases, there are 2 active members and we just need to formally clean up the inactive chairs.

thesuperzapper commented 11 months ago

Also, some of the proposed chairs are not even current Kubeflow org members, so are ineligible unless they go through that process first:

rimolive commented 11 months ago

Thank you for the references! Those are valid points though, and I'll see how we can work on the eligibility topic as well as your concerns.

tarilabs commented 11 months ago

As Ricardo noted, thanks !

Is there guidance for deputies to keep work WG ongoing during leaves, please? The reason >3 is I was going through this point earlier today and seeing other WGs have >3 I assumed it was for that semantic.

As noted, will work out to account all the feedback received; thank you those are very helpful

andreyvelich commented 11 months ago

Thank you for starting this @tarilabs! Let's collaborate together on this PR for the WG Charter and Name.

Please provide your suggestion on how we should name this WG that initially will have Spark Operator and Model Registry component.

A few initial suggestions if WG Lifecycle is too ambitious:

I would expect at least some concern with having 4 leads from one company and only 1 from another.

This is valid concern @thesuperzapper. We can add folks from Spark Operator maintainers to this WG cc @mwielgus @vara-bonthu @yuchaoran2011

andreyvelich commented 11 months ago

cc @kubeflow/wg-training-leads @kubeflow/wg-pipeline-leads @kubeflow/wg-deployment-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-manifests-leads

bigsur0 commented 11 months ago

I would request "WG ML Lifecycle" if the purpose of the group is to house things in the MLOps orbit that don't have a more specific working group yet so they can "incubate". Data Preparation, Feature Store, and Model Registry being 3 examples that have been recently discussed that likely aren't big enough yet to have their own working group. I guess one key aspect here is to consider how new efforts can happen without the overhead of setting-up a new working group for each one until it is truly merited and bandwidth is available.

Is there a process that exists for refactoring a topic out of one working group to a new working group?

jbottum commented 11 months ago

Kubeflow seems to be entering a new growth phase. The community needs a structure to support add-on components (Spark, Ray, Model Registry, Feature Store, etc). We want to encourage contributors and users to meet, discuss, experiment, decide, store code and produce documentation with a goal that integrations will help both Kubeflow and the add-on projects. We need to minimize overhead. We need to set expectations (of support...to/from Kubeflow and for users) especially if we are experimenting and trying to find market acceptance. Most importantly, we need active user participation, comment and leadership. I want to move this forward...I am a +1 to adding a single umbrella WG for all of these projects to get things moving. @james-jwu would you please provide your thoughts

thesuperzapper commented 11 months ago

I think that the name WG Data will happily encompass the various categories proposed:

Also, WG Data follows the convention of being a single word, like all other working group names.

I am still very against WG Lifecycle, at best it's like calling it WG Other because the whole point of Kubeflow is to map across the MLOps lifecycle, so it's just confusing.


Separately to the discussion around names, I think we should confirm that the maintainers of these various components are actually overlapping, otherwise it will make it difficult for this "mega working group" to function.

vara-bonthu commented 11 months ago

+1 to @thesuperzapper

I would suggest voting for WG Data, as it seems most appropriate for the Spark Operator. This is because it is primarily used for data processing, both batch and streaming, as well as some ML processing.

tarilabs commented 11 months ago

New commit ae188fe incorporates some feedback received around:

will keep posted during KF Community meeting on any further updates.

thesuperzapper commented 11 months ago

Just so we are clear, I think WG Data should be the name, not WG ML Data as the PR currently stands.

tarilabs commented 10 months ago

fyi I've added draft of the charter for this WG on suggestion by other members with commit: https://github.com/kubeflow/community/pull/673/commits/f77d17b4ea598951f538e6afa2e52511bc205c28

according to KF process the Charter is to be submitted after:

Add WG-related docs like charter.md, schedules, roadmaps, etc. to your new kubeflow/community/wg-foo directory once the above PR is merged

from here: https://github.com/kubeflow/community/blob/master/wgs/wg-lifecycle.md#:~:text=Add%20WG%2Drelated%20docs%20like%20charter.md%2C%20schedules%2C%20roadmaps%2C%20etc.%20to%20your%20new%20kubeflow/community/wg%2Dfoo%20directory%20once%20the%20above%20PR%20is%20merged

The group however pointed out in more recent WGs creation the Charter was submitted with the WG creation PR. example: https://github.com/kubeflow/community/pull/358

Therefore, advancing Charter proposal at once in this PR. I'm going to "migrate" some comments as review on the Markdown.

StefanoFioravanzo commented 4 months ago

@kubeflow/kubeflow-steering-committee @tarilabs what is stopping us from merging this PR?

tarilabs commented 4 months ago

I think some comments from Andrey needs to be taken care of, also likely since, the proposed ACL would need to be refreshed if needed. Didn't really have a chance (at least personally) to dedicate time to this recently, but looking back into this, maybe as soon as 1.9 will have reached milestone

tarilabs commented 3 months ago

@andreyvelich per KF community meeting 2024-08-06, Model Registry is supportive of additional leads as required to progress on the Spark Operator 👍

franciscojavierarceo commented 3 months ago

I'm late to the party here but I'd be happy to be involved from the Feast perspective. 👋

I was a maintainer at a previous company (before joining Red Hat) so I my perspective may be a little less Red Hat centric. :)

google-oss-prow[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign james-jwu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/community/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tarilabs commented 3 months ago

@andreyvelich

@ChenYi015 added your name and if I understood correctly from past meeting as affiliated from Alibaba Cloud, but kindly let me know if I misunderstood something

@franciscojavierarceo added your name as discussed in this PR

ChenYi015 commented 3 months ago

@tarilabs That is correct, thanks for adding my name.

andreyvelich commented 3 months ago

Please review the charter. /assign @kubeflow/wg-pipeline-leads @kubeflow/wg-training-leads @kubeflow/wg-automl-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-manifests-leads @kubeflow/wg-data-leads @kubeflow/kubeflow-steering-committee

juliusvonkohout commented 3 months ago

Would this working group be relevant for the minio replacement (seaweedfs) as well?

I am currently working on a PoC in Kubeflow/manifests.

tarilabs commented 3 months ago

I've added all comments pertaining to Feast in a single commit with fa3c3187d1d731bfe723c9c9b633a868baad051c, so to more easily manage that addition to this wg charter if required or by feedback from SC.

tarilabs commented 3 months ago

Would this working group be relevant for the minio replacement (seaweedfs) as well?

not entirely sure, that to me is more a "storage"-related concern, while "data"-related concern expressed here are more orthogonal to the actual medium.

I am currently working on a PoC in Kubeflow/manifests.

I'm very happy however to engage in discussions, since "storage" is also a dimension we're exploring for Model Registry (bringing in OCI as first class, but potentially others with an abstraction layer). Let me know your thoughts!

andreyvelich commented 3 months ago

Thank you for addressing the feedback @tarilabs!

Given that we still have discussion around WG governance and what projects WGs should maintain: https://github.com/kubeflow/community/pull/673#discussion_r1715256715, should we include Feast addition as a separate PR after followup discussion ?

From my point of view, initially we should just establish the Data WG with 2 Kubeflow components: Spark Operator and Model Registry, and after that we can update charter to include Feast and other projects that we want to maintain under this WG.

Any thoughts @franciscojavierarceo @kubeflow/kubeflow-steering-committee @tarilabs ?

franciscojavierarceo commented 3 months ago

I would love for Feast to be included as I think the Data WG is a great opportunity to validate Feast's relevance and drive some urgency to closing the discussion on adding new projects, but I'll respect the outcome either way, of course.

See PR here: https://github.com/kubeflow/community/pull/741

CC @jbottum

andreyvelich commented 3 months ago

I would love for Feast to be included as I think the Data WG is a great opportunity to validate Feast's relevance and drive some urgency to closing the discussion on adding new projects

I agree with you @franciscojavierarceo, but should we include Feast in the Data WG once we make Feast as part of Kubeflow core components ?

jbottum commented 3 months ago

Per my comment in the Community meeting, I support Feast as part of the WG Data and as a core KF component. I am glad to pursue that path or another, if that cannot be accomplished (as I believe a defined relationship would help both communities).

franciscojavierarceo commented 3 months ago

@andreyvelich I am okay including Feast before making it a core component. :)

juliusvonkohout commented 3 months ago

I'm very happy however to engage in discussions, since "storage" is also a dimension we're exploring for Model Registry (bringing in OCI as first class, but potentially others with an abstraction layer). Let me know your thoughts!

Then https://github.com/kubeflow/manifests/pull/2826 and https://github.com/kubeflow/pipelines/pull/10998 might be interesting for you.

andreyvelich commented 3 months ago

Would this working group be relevant for the minio replacement (seaweedfs) as well?

I am currently working on a PoC in Kubeflow/manifests.

@juliusvonkohout This issue is related to Kubeflow Pipelines (e.g. Pipelines WG), isn't ?

juliusvonkohout commented 3 months ago

Would this working group be relevant for the minio replacement (seaweedfs) as well?

I am currently working on a PoC in Kubeflow/manifests.

@juliusvonkohout This issue is related to Kubeflow Pipelines (e.g. Pipelines WG), isn't ?

Anyone who needs S3 storage in Kubeflow, but especially pipelines.

rimolive commented 1 month ago

Bumping this PR. What is missing to get this merged?

andreyvelich commented 1 month ago

Bumping this PR. What is missing to get this merged?

I think, we need to make a decision with Feast. @kubeflow/kubeflow-steering-committee What are your thoughts on this ?