kubeflow / community

Information about the Kubeflow community including proposals and governance information.
Apache License 2.0
160 stars 220 forks source link

Model Registry proposal (ref KF community meeting 20240102) #682

Open tarilabs opened 11 months ago

tarilabs commented 11 months ago

Following feedback received during KF community meeting held 20240102, raising the Model Registry proposal google doc previously shared with the community: (link), as a Markdown in the form of Pull Request (this PR).

See also

google-oss-prow[bot] commented 11 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tarilabs Once this PR has been reviewed and has the lgtm label, please assign james-jwu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/community/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
rchincha commented 8 months ago

Hi All, not sure if folks here are aware of the work going on over at Open Containers Initiative (OCI) wrt standardizing registry interfaces and general-purpose "artifacts".

https://github.com/opencontainers/distribution-spec

OCI is a sibling org under Linux Foundation and its specs closely interoperate with Kubernetes.

https://opencontainers.org/

Just bringing this to your attention that it is now possible to colocate arbitrary artifacts (content-addressable) along with relationships and provenance in a OCI dist-spec v1.1.0 conformant image registry, and not just container images.

The main motivation in this forum would be:

  1. Image registries already a required part of Kubernetes ecosystem and app lifecycle
  2. Many problems are likely already solved
  3. So if possible, why not re-use instead of implementing and maintaining something entirely new
  4. Excellent community tooling already available

A registry you can quickly spin up and play with: https://zotregistry.dev/ Azure, AWS and others should hopefully announce support soon.

Thoughts?

Full disclosure: I am a OCI TOB member and zot author.

rchincha commented 8 months ago

Talk is cheap, here is an example (using only a subset of OCI dist-spec v1.1.0 features) https://github.com/project-zot/zot/pull/2332

Also, an upcoming blog post clarifying most things. https://github.com/opencontainers/opencontainers.org/blob/395bc5f98777a72082bfe300a167b563af234ef0/content/posts/blog/2024-03-13-image-and-distribution-1-1.md

rareddy commented 8 months ago

@rchincha thank you for reaching out

see this about oci reference in the proposal https://github.com/kubeflow/community/pull/682/files#diff-aaf54745ecb36016135c83a5a41a03025574ecb492aec56ef6d2c7c902abfe17R180

Can I recommend you open another PR in the model registry and can we collaborate on a proposal on how you see the integration working, we have verified that we could store and retrieve the models without any issues but we have not explored how/if we should spread the metadata and query it back (if that is needed at all is another question as it can be stored in the db) and also how we can influence the consumption of the model for inference directly from OCI repo as it in projects like Kserve.

rchincha commented 8 months ago

Can I recommend you open another PR in the model registry and can we collaborate on a proposal

Would love to. Also folks over at CNCF artifacts, ORAS and OCI would certainly be interested.

Initial grok'ing of kserve project indicates that there could be a couple of ways to do this:

  1. an "initContainer" approach that pulls required artifacts and lays them out so it can be consumed

  2. A CSI approach like so: https://github.com/converged-computing/oras-csi https://kserve.github.io/website/0.8/modelserving/storage/pvc/pvc/#create-pv-and-pvc

rchincha commented 8 months ago

https://github.com/kubeflow/model-registry/pull/48 ^ fyi, thanks.

rareddy commented 8 months ago

Also folks over at CNCF artifacts, ORAS and OCI would certainly be interested.

@rchincha we are collaborating with ORAS maintainers and model-car initiative inventors let's see we can bring their attention on this effort for storage. We already put in some work towards KServe Storage Containers which will be another way for providing the models for inferencing.

A couple of requests for proposal,

I did look at ArtifactHub project a couple of months ago in CNCF which looked very interesting in terms of how they use OCI and metadata scraping but did not draw any conclusions about how that could be folded into the mix to bridge the metadata portion or not. That could be very interesting IMO. Is this CNCF project u mentioned above?

rchincha commented 8 months ago

Also folks over at CNCF artifacts, ORAS and OCI would certainly be interested.

@rchincha we are collaborating with ORAS maintainers and model-car initiative inventors let's see we can bring their attention on this effort for storage. We already put in some work towards KServe Storage Containers which will be another way for providing the models for inferencing.

wrt kserve, maybe this as a contract? https://github.com/kserve/kserve/pull/3539

A couple of requests for proposal,

* we need to be able to support multiple storage backends as S3 is predominately the most preferred method currently to be used by the AI communities.

This is best left to the registry implementations which may or may not choose to support S3 backend (for example, speaking only for zot, it does support S3), but make it clear that to be compatible with kubeflow, this is an additional requirement.

* For the OCI plugin it must be based OCI-Dist level so that users can have a choice of their Zot, Harbor or Quay etc.

The OCI plugin must be registry-agnostic of course and this calls out the role that OCI dist-spec v1.1.0 plays as a contract.

* must be able to deploy in Kube, as a lot of users want to able to deploy all infra on their cloud not necessarily always connect to an external SaaS offering.

Another additional requirement, and comes with the territory.

I did look at ArtifactHub project a couple of months ago in CNCF which looked very interesting in terms of how they use OCI and metadata scraping but did not draw any conclusions about how that could be folded into the mix to bridge the metadata portion or not. That could be very interesting IMO. Is this CNCF project u mentioned above?

As I understand it, ArtifactHub predates OCI dist-spec v1.1.0 but there may be interest to standardize on this dist-spec.

metadata scraping

OCI dist-spec v1.1.0 has explicit provisions for this. But can you kindly point to some concrete examples.

Will update https://github.com/kubeflow/model-registry/pull/48

tarilabs commented 8 months ago

metadata scraping

OCI dist-spec v1.1.0 has explicit provisions for this. But can you kindly point to some concrete examples.

Personally very curious for examples on this topic! :) that is very interesting in the context of potentially indexing/query for Manifest of metadata (a "model registry" use case) by means of OCI Artifact.

rchincha commented 8 months ago

metadata scraping

OCI dist-spec v1.1.0 has explicit provisions for this. But can you kindly point to some concrete examples.

Personally very curious for examples on this topic! :) that is very interesting in the context of potentially indexing/query for Manifest of metadata (a "model registry" use case) by means of OCI Artifact.

https://github.com/opencontainers/opencontainers.org/blob/395bc5f98777a72082bfe300a167b563af234ef0/content/posts/blog/2024-03-13-image-and-distribution-1-1.md#describing-associations

^ this is how the OCI community has addressed this. Note that the original use case was container images and associated metadata such as SBOMs etc.

So in this case ...

  1. upload model data (of a particular media-type)
  2. upload model metadata (of a particular media-type and subject:=1. above)
  3. download 1.
  4. download "artifacts referring to 1." and optionally "of a particular media-type"
tarilabs commented 8 months ago

Thanks @rchincha , is there a way to avoid having to download the associated metadata, only to query for it locally, and do that "on the OCI registry" server end?

Example Here I have 3 different ML models stored as OCI artifacts: https://quay.io/repository/mmortari/mnist?tab=tags

I know some metadata for each of those. I'm looking for a solution if possible which doesn't require me to download the associated metadata-Manifest of each of the artifacts locally, in order to query those metadata. For concrete example, if each of the model defines accuracy=0.987 or the likes, I want to query which ML artifacts in mmortari/mnist repo above have max(accuracy)

Hope the example convey the question I'm curious for. Edit: that is why @rareddy was referring to analogous of ArtifactHub, as it would seem from capability and use pov, very similar use-case, in a way.

rchincha commented 8 months ago

@tarilabs

For concrete example, if each of the model defines accuracy=0.987 or the likes, I want to query which ML artifacts in mmortari/mnist repo above have max(accuracy)

In the OCI dist-spec world, one way would be to list all tags in a repository, get their manifests and compare annotations (== accuracy=0.987) - no need to download actual data.

I was more concerned about the following: https://github.com/MarquezProject/marquez https://github.com/google/ml-metadata

tarilabs commented 8 months ago

Thanks @rchincha , reassuring to hear it doesn't need to download actual data, will be looking for a chance to understand in more details from you how OCI dist-spec works for this use-case in practice.

We have Model Registry biweekly meetings: https://www.kubeflow.org/docs/about/community/#kubeflow-community-calendars

Do you think you'll be able to join one, so we could discuss it live in more details? Thanks!

rchincha commented 8 months ago

https://kccnceu2024.sched.com/event/1YeLi ^ This idea is spreading around I suppose ... @Kubecon EU 2024

Your next meeting is Apr 1. Will try to make that.

rchincha commented 6 months ago

https://github.com/kubernetes/enhancements/pull/4642 some overlapping work/groups ...

tarilabs commented 6 months ago

kubernetes/enhancements#4642 some overlapping work/groups ...

iiuc this would allow "materializing" OCI artifacts as a mounted volume in a container, effectively allowing the "files" inside an OCI artifacts to be available for inference say in a running container of a model server. is this a fair summary?

rchincha commented 6 months ago

kubernetes/enhancements#4642 some overlapping work/groups ...

iiuc this would allow "materializing" OCI artifacts as a mounted volume in a container, effectively allowing the "files" inside an OCI artifacts to be available for inference say in a running container of a model server. is this a fair summary?

Still a preliminary KEP, but would seem so.

rhuss commented 6 months ago

kubernetes/enhancements#4642 some overlapping work/groups ...

iiuc this would allow "materializing" OCI artifacts as a mounted volume in a container, effectively allowing the "files" inside an OCI artifacts to be available for inference say in a running container of a model server. is this a fair summary?

For reference, in KServe a workaround for directly accessing files within an OCI image is implemented and available via a sidecar approach ("modelcar") by leveraging root FS system access via the /proc filesystem when shareProcessNamespace: true is set on the Pod. You can find details in the KServe documentation and in the Design Document. It actually implements the desired behavior with current means, but of course is more or less just a workaround of an OCI volume type (as discussed already a long time ago in https://github.com/kubernetes/kubernetes/issues/831)

tarilabs commented 6 months ago

For reference, in KServe a workaround for directly accessing files within an OCI image is implemented and available via a sidecar approach ("modelcar") by leveraging root FS system access via the /proc filesystem when shareProcessNamespace: true is set on the Pod. You can find details in the KServe documentation and in the Design Document. It actually implements the desired behavior with current means, but of course is more or less just a workaround of an OCI volume type (as discussed already a long time ago in kubernetes/kubernetes#831)

thank you @rhuss , to me is about providing user-choice; given an opportunity to have OCI Artifact with a ML model asset:

wdyt?

rchincha commented 3 months ago

https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/ ^ fyi

tarilabs commented 3 months ago

Thank you @rchincha , we indeed noted that blog post as well :)

Fyi, we have it in our live-roadmap as a proposal for integration as a preferred storage solution for the ML model, to complement current Model Registry.

Orthogonal research work in this area, is captured here.