Open zijianjoy opened 2 years ago
I would clarify it a bit more
It is open sourced and uses license that can be accepted by KFP, something similar to Apache 2.0.
Orchestration engine (like Argoworkflow) can interact with this object store (S3 compatibility).
It has fine-grained S3 permission control e.g. ${aws:username} https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html#policy-vars-wheretouse. Please see https://github.com/kubeflow/pipelines/pull/7725 for an example implementation
Same or similar function set as MinIO. We need at least one bucket called mlpipeline and multiple users and policies to manage which user can access which folders. Easy Scalability would also be good.
Each container must run as non-root and unpriviledged
https://www.zenko.io/cloudserver/ no user management at all in the free version
https://longhorn.io/ Does it even support S3 ? https://github.com/longhorn/longhorn/issues/353
https://github.com/openstack/swift i cannot find information about ACLs
ceph-rook it seems to support aws:username but only bucketpolicies. so pipeline definitions must not be stored on S3 then. It should reside in a separate namespace since it has a lot of pods. Nevertheless the most promising software
@zijianjoy , is the MinIO license change to AGPLv3 the only motivation for looking for alternatives? As you wrote, AGPL software requires that anything it links to must also be licensed under the AGPL
, but this should not be a limitation for upgrading to the latest MinIO:
ObjectStoreInterface interface
by invoking some S3 client, which can be shipped with the apiserver image. That way KFP will not depend on specific object store service, as long as the service supports the S3 API.Minio is only used as a storage gateway. This feature has been deprecated in Minio since over a year https://blog.min.io/deprecation-of-the-minio-gateway/
To me the simplest replacement is to use an actual S3 client (which hopefully is also an S3 compatible client because not everyone is on AWS) It needs to support the custom endpoints, region, and path-style or virtual host style URLs, as well as the various auth methods. The problem with that, is that obviously that requires an actual object store service to run... which is where Minio could still be used as the in-cluster object store solution (but not as a gateway)
Minio is only used as a storage gateway. This feature has been deprecated in Minio since over a year https://blog.min.io/deprecation-of-the-minio-gateway/
To me the simplest replacement is to use an actual S3 client (which hopefully is also an S3 compatible client because not everyone is on AWS) It needs to support the custom endpoints, region, and path-style or virtual host style URLs, as well as the various auth methods. The problem with that, is that obviously that requires an actual object store service to run... which is where Minio could still be used as the in-cluster object store solution (but not as a gateway)
No, your assesment is not correct. Minio usage differs by distribution and the most important default one used by Kubeflow is NOT the gateway mode. We need a reliable storage and gateway replacement. For further information read the whole conversation here https://github.com/kubeflow/pipelines/pull/7725
No, your assesment is not correct. Minio usage differs by distribution and the most important default one used by Kubeflow is NOT the gateway mode. We need a reliable storage and gateway replacement. For further information read the whole conversation here #7725
My bad, I wasn't clear. Minio is used as the local storage in the vanilla distro, and liekly that's what is used on-premises, when no storage backend is defined or available, otherwise it is used as a 'gateway' for pretty much every cloud distro. Whether it is used in gateway mode, or using S3-compatible backend. The problem is it is old and doesn't support several options that make it hard to work with cloud other than AWS. One of the main issue for me is that it doesn't support the region param. Anyway, regardless of how it is used, my suggestion was to update the object storage client code in pipelines to use a native, modern, S3-compatible client that supports all options, so there is no need for Minio at all when using a cloud object storage backend. For the vanilla distro that still requires a in-cluster storage solution, Minio S3 compat still does the job, and it's only a matter of pointing the S3 client to it.
I agree with @streamnsight - the ObjectStoreInterface interface
can be implemented using S3-compatible client, golang or invoking a standalone binary . There should be no MinIO dependency in the KFP code. We can still use MinIO, rook or whatever object store we choose.
This issue here is about the server, not the client.
"For the vanilla distro that still requires a in-cluster storage solution, Minio S3 compat still does the job, and it's only a matter of pointing the S3 client to it." No, that is the reason for this issue. We need a server side replacement.
This issue here is about the server, not the client.
Currently there is a dependency between the server and the client, the server cannot be changed easily without client modifications. The first step is to replace the client with a generic S3-compatible client, without changing the server. After that we can change the server with any object store that supports S3 API.
Why do you think that the MinIO server needs to be replaced ? According to the issue description, using AGPL software requires that anything it links to must also be licensed under the AGPL
, but obviously the MinIO server does not link to any KFP code. The MinIO client links to KFP, but as it remains under Apache 2.0 License, there is no licensing restrictions to use the latest MinIO server and client versions.
@tzstoyanov "but as it remains under Apache 2.0 License, there is no licensing restrictions to use the latest MinIO server and client versions." Googles Lawyers say otherwise. I have discussed it with their team several times.
Please also read everything from https://github.com/kubeflow/pipelines/pull/7725 to fully understand the long-term goals for the artifact storage within Kubeflow.
You can already use the minio server with most S3 clients, but yes, maybe it is problematic to use the minio client for other S3 servers.
We are always looking for volunteers. I already mentored several people that now have their first commits to the Kubeflow project.
@juliusvonkohout, I looked at the comments of the #7725. Thanks for pointing me to that, now I have a better idea of the problem and the work you did in the context of that PR. I have a few questions, will be happy if you or someone from the community can answer:
Googles Lawyers say otherwise. I have discussed it with their team several times.
I wonder if these discussions were public? I'm curious to see their arguments against this specific use case. I'm not a license expert, but the AGPL restrictions are pretty clear. From my understanding, and according to the description of this specific issue I cannot see any license violation in our use case - we do not modify any AGPL code, nor we do link to any AGPL code. Moreover - the KFP image is already based on Debian, there are a lot of GPL binaries distributed as part of it.New image has to pass license review
. Do you know what is that license review process, is it described somewhere?In any case, replacing the MinIO client with some generic S3 client is a good idea. This will make KFP more flexible and is the first step in replacing the MinIO server. Do you know if there is a specific issue about the client?
I can contribute to that, implement the ObjectStoreInterface interface
using a generic S3 client. We can use the official AWS SDK, or use a command line client such as s5cmd or s3cmd.
@tzstoyanov please reach out on LinkedIn or slack for discussion. I am already working with one of your colleagues @difince https://kccnceu2023.sched.com/event/1HyY8/hardening-kubeflow-security-for-enterprise-environments-julius-von-kohout-dhl-deutsche-telekom-diana-dimitrova-atanasova-vmware
Maybe these here are are lower hanging fruit. https://github.com/kubeflow/kubeflow/pull/7032#issuecomment-1505277720 we really need to focus on what to work on first, because getting stuff into KFP is difficult.
Do we have any update on this? :)
cubefs looks promising.
@gsoec can you articulate why cubefs looks promising?
I think we need an assessment similar to what @juliusvonkohout did in https://github.com/kubeflow/pipelines/issues/7878#issuecomment-1169070899. Based on that, comparing with ceph-rook appears to be most relevant. From what I see here, the latter appears to be the favorable solution...
@lehrig please take a look at the last comments of https://github.com/kubeflow/pipelines/pull/7725. i think we need to use istio or something else for the authentication part since only a few s3 providers fully support enterprise-level user management and authorization. Furthermore we could just plug and play any basic-S3 compatible storage backend and get rid of passwords and the necessary rotation altogether.
So that you all know, sticking to this release has three different security vulnerabilities that are high or critical.
So, anyone running this release would be exposed and vulnerable to these vulnerabilities, potentially causing data loss.
What does it take for Kubeflow to upgrade the MinIO container version? - I can help.
Agpl is not allowed, so Google denied an update. Please join the meetings at https://www.kubeflow.org/docs/about/community/ for discussing it. Especially the KFP meeting
Another alternative could be https://ozone.apache.org
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
not stale
Another alternative could be https://ozone.apache.org
it might be suitable for the multi bucket approach https://ozone.apache.org/docs/1.3.0/feature/s3-multi-tenancy-access-control.html
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen
We are actively working on this.
Feature Area
/area backend /area frontend
What is the use case or pain point?
Currently, KFP is using MinIO as the default object store for Artifact payload and Pipeline template. However, MinIO has fully changed its license to AGPLv3 since 2021. This change has prevented KFP from upgrading to the latest of MinIO: using AGPL software requires that anything it links to must also be licensed under the AGPL. Since we are not able to adopt the latest change from MinIO, we are seeking alternatives to replace it in future KFP deployment.
What feature would you like to see?
This new object store solution should have the following nature:
Exploration
We are currently considering following options, we would like to hear your opinions as well.
cc @chensun @IronPan @james-jwu
Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.