kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.63k stars 1.63k forks source link

S3 support in Kubeflow Pipelines #3405

Closed Jeffwan closed 8 months ago

Jeffwan commented 4 years ago

Create this parent issue to includes all known S3 problems in KFP and I also like to talk about more feature requests in this ticket.

Use cases

Replace Minio with S3 for Argo artifact store and KFP Pipeline store.

[] Manifest changes in Kubeflow/manifest and standalone kubeflow/pipeline/manifest to easily change from minio to S3 [] IRSA - bump minio-sdk-go version to support IRSA [] IRSA - bump argo workflow version to make sure runner can persist artifact to S3

Relate issues and PRs

UI

[x] KFP UI should be able to read source file in S3, for example

mlpipeline-ui-metadata -> minio://mlpipeline/artifacts/pipeline-A/pipeline-A-run-id/mlpipeline-ui-metadata.tgz

{
  "outputs": [
    {
      "source": "s3://your_bucket/README.md",
      "type": "markdown"
    }
  ]
}

[] KFP UI should be able to read artifact files in S3

mlpipeline-ui-metadata -> s3://mlpipeline/artifacts/pipeline-A/pipeline-A-run-id/mlpipeline-ui-metadata.tgz

{
  "outputs": [
    {
      "source": "s3://your_bucket/README.md",
      "type": "markdown"
    }
  ]
}

[] IRSA - Bump minio-js version to support IRSA

SDK

[x] Use can declare custom S3 artifact location inside the pipeline. [x] User can apply AWS credentials to pipeline pods to get access to S3. [x] User can specify service account for pipeline. (IRSA)

More examples

/kind improvement /area frontend /area backend

/cc @eterna2 @discordianfish

k8s-ci-robot commented 4 years ago

@Jeffwan: The label(s) kind/improvement cannot be applied, because the repository doesn't have them

In response to [this](https://github.com/kubeflow/pipelines/issues/3405): >Create this parent issue to includes all known S3 problems in KFP and I also like to talk about more feature requests in this ticket. > >### Use cases > >#### Replace Minio with S3 for Argo artifact store and KFP Pipeline store. >[] Manifest changes in Kubeflow/manifest and standalone kubeflow/pipeline/manifest to easily change from minio to S3 >[] IRSA - bump minio-sdk-go version to support IRSA >[] IRSA - bump argo workflow version to make sure runner can persist artifact to S3 > >Relate issues and PRs > - https://github.com/kubeflow/pipelines/pull/2080/ > - https://github.com/kubeflow/pipelines/pull/2081/ > - https://github.com/kubeflow/pipelines/issues/3388 > - https://github.com/kubeflow/pipelines/issues/2172 > - https://github.com/kubeflow/pipelines/issues/3398 > - https://github.com/kubeflow/pipelines/issues/3387 > - https://github.com/kubeflow/pipelines/issues/3366 > >### UI >[x] KFP UI should be able to read source file in S3, for example > >mlpipeline-ui-metadata -> minio://mlpipeline/artifacts/pipeline-A/pipeline-A-run-id/mlpipeline-ui-metadata.tgz > >```json >{ > "outputs": [ > { > "source": "s3://your_bucket/README.md", > "type": "markdown" > } > ] >} >``` > > >[] KFP UI should be able to read artifact files in S3 > >mlpipeline-ui-metadata -> s3://mlpipeline/artifacts/pipeline-A/pipeline-A-run-id/mlpipeline-ui-metadata.tgz > >```json >{ > "outputs": [ > { > "source": "s3://your_bucket/README.md", > "type": "markdown" > } > ] >} >``` > >[] IRSA - Bump minio-js version to support IRSA > >- https://github.com/kubeflow/pipelines/pull/1278 >- https://github.com/kubeflow/pipelines/pull/1285/files > >### SDK >[x] Use can declare custom S3 artifact location inside the pipeline. >[x] User can apply AWS credentials to pipeline pods to get access to S3. >[x] User can specify service account for pipeline. (IRSA) > >- https://github.com/kubeflow/pipelines/pull/1064 >- https://github.com/kubeflow/pipelines/pull/1133 > >#### More examples >- https://github.com/kubeflow/pipelines/issues/3185 >- https://github.com/kubeflow/pipelines/issues/596 >- https://github.com/kubeflow/pipelines/pull/1338/files > >/kind improvement >/area frontend >/area backend > > >/cc @eterna2 @discordianfish Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Ark-kun commented 4 years ago

SDK: Use can declare custom S3 artifact location inside the pipeline.

This is deprecated. The supported method is to configure the artifact location in the workflow-configmap

There is also an option to use Minio gateway to proxy requests to the actual storage system (Google Cloud Storage or Amazon S3). WDYT?

Jeffwan commented 4 years ago

@Ark-kun artifact location in workflow-configmap works for us now. Do you have any docs or code reference on minio gateway?

Ark-kun commented 4 years ago

Do you have any docs or code reference on minio gateway?

Here is the Minio doc: https://docs.minio.io/docs/minio-gateway-for-s3.html

eterna2 commented 4 years ago

@Jeffwan

Update metadata writer service to support S3 artifacts Metadata writer service currently hardcode metadata artifacts url to minio. Need to investigate how to infer whether the artifact store endpoint is s3 or minio for metadata service.

https://github.com/kubeflow/pipelines/blob/6b358d09a763d2b172c49cd389f34c5c683daa9b/backend/metadata_writer/src/metadata_writer.py#L85

See also

Ark-kun commented 4 years ago

Metadata writer service currently hardcode metadata artifacts url to minio.

The frontend translates this to ReadArtifact backend calls.

Regarding S3: I think there are two distinct cases: Artifacts. URIs should be based on the workflow-controller-configmap s3 bucket configuration mlpipeline-ui-metadata entries that might have any kind of URIs.

Jeffwan commented 4 years ago

@Ark-kun minio gateway seems an extra layer under minio? User need to use minio service with gateway to delegate request to S3. Is that correct? I think that solve some of the requirements and kind of easy but brings extra layer.

mlpipeline-ui-metadata entries that might have any kind of URIs.

Yes. For sure, the problem is it's always minio here, if user want to use S3 to replace minio. Currently, it's not working well.

discordianfish commented 4 years ago

I'm confused as well. If somebody could outline what steps need to be taken to support s3, I can help implementing them. But reading the comments here leaves me none the wiser.

rmgogogo commented 4 years ago

https://github.com/e2fyi/kubeflow-aws/tree/master/pipelines would this Kustomization help?

As for GCS backed, it's here: https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/sample

Ark-kun commented 4 years ago

@Ark-kun minio gateway seems an extra layer under minio? User need to use minio service with gateway to delegate request to S3. Is that correct? I think that solve some of the requirements and kind of easy but brings extra layer.

Yes. It's served as proxy. The advantage of using it is that all services (UX, Backend, Argo etc) only need to talk to one kind of service - in-cluster Minio. And the configuration (credentials etc) can be specified in one place. @IronPan or @numerology might have more information.

Jeffwan commented 4 years ago

@Ark-kun I checked gateway and run some tests. This is working fine. I can file a PR to aws overlay. The current question is some users don't like to use minio as gateway, do we want to support direct object store access? The PRs listed above aims to solve this problem.

@rmgogogo e2fyi repo can not replace entire minio. Still needs some changes.

discordianfish commented 4 years ago

Isn't minio suppose to be compatible to s3? I don't understand the need for the gateway.

How do we authenticate clients connecting to the minio gateway? Does that require an additional layer of authentication? Or is this enforced by istio on a networking level?

Not convinced this is better than granting individual components IAM permissions.

Jeffwan commented 4 years ago

@discordianfish Gateway is kind of easy in current manifest structure. KFP just need S3 permission and minio will becomes a proxy to S3. There's no need to make code changes everywhere to check protocol. If you see above PRs, It's kind of painful now. We also need to give credential or attach roles in every KFP component talks to S3.

I agree this won't meet everyone's need, we definitely need low level object storage configuration.

discordianfish commented 4 years ago

@Jeffwan What I don't understand is why we need to distinguish between minio and s3 at all, instead of just tracking the bucket name and the endpoint. The minio client would be configured with either auth credentials or fall back to the instance role. That should be the same config for all components, no?

But okay, if people want to go down the gateway route: How do all components that need s3/minio access authenticate against the gateway? Not using auth here would be a no go for us.

Jeffwan commented 4 years ago

@discordianfish

What I don't understand is why we need to distinguish between minio and s3 at all, instead of just tracking the bucket name and the endpoint.

I see. It's not consistency across all components. KFP has lots of references has fixed protocol like minio:// which won't work for S3.

But okay, if people want to go down the gateway route: How do all components that need s3/minio access authenticate against the gateway? Not using auth here would be a no go for us.

The only difference between minio and minio gateway with S3 is, user need to give credential with s3 policy to minio pod. instance profile will work as well. I have not checked with IRSA or Kube2IAM for fine grain control yet.

eterna2 commented 4 years ago

Just want to add a note.

One main reason why we moved to s3 instead of minio gateway (this was our original deployment) is because tensorboard doesn't work with minio.

It only support gcs and s3. It is using boto3 instead of minio the last time I checked (maybe 6 months ago?).

Other than that I think minio gateway is fine as long as the throughput is fine. But the concern is that u can't have fine grain rbac control with minio gateway.

And it is confusing to our data scientists when they have to mix between s3 and minio (i.e. s3 for our data, minio for data generated within kfp).

discordianfish commented 4 years ago

I see. It's not consistency across all components. KFP has lots of references has fixed protocol like minio:// which won't work for S3.

Yes, but why? The protocol part in that url is just for the client to know where to connect so. So for minio and s3, I suggest we always use the endpoint, bucket name and path to refer to an object. It shouldn't matter for the client whether this is s3 or minio. Does that make sense?

The only difference between minio and minio gateway with S3 is, user need to give credential with s3 policy to minio pod. instance profile will work as well. I have not checked with IRSA or Kube2IAM for fine grain control yet.

I understand that, but we need to authenticate access from kubeflow components to the minio gateway. If I understand you correctly, with minion gw everything that is able to reach minion will have access to all s3 content without auth. That's not acceptable for us.

Jeffwan commented 4 years ago

I see. It's not consistency across all components. KFP has lots of references has fixed protocol like minio:// which won't work for S3.

Yes, but why? The protocol part in that url is just for the client to know where to connect so. So for minio and s3, I suggest we always use the endpoint, bucket name and path to refer to an object. It shouldn't matter for the client whether this is s3 or minio. Does that make sense?

https://github.com/kubeflow/pipelines/pull/3531/files#diff-4e5a9c52bbed342532f527bae3aae002R15-R20 This is one example, lots of codes rely on protocol to have custom logic. I get your point and that would be ideal case, we don't alway have endpoint passed with artifact url everywhere. If there's a refactor to have a object will endpoint, bucket, path. Then all the customization can be changed to reply on endpoint. It will work but have more efforts.

The only difference between minio and minio gateway with S3 is, user need to give credential with s3 policy to minio pod. instance profile will work as well. I have not checked with IRSA or Kube2IAM for fine grain control yet.

I understand that, but we need to authenticate access from kubeflow components to the minio gateway. If I understand you correctly, with minion gw everything that is able to reach minion will have access to all s3 content without auth. That's not acceptable for us.

Right, that's what @eterna2 mention as well. Fine grain control doesn't work with minio gateway.

discordianfish commented 4 years ago

If there's a refactor to have a object will endpoint, bucket, path. Then all the customization can be changed to reply on endpoint. It will work but have more efforts.

Yes, that's what I'm suggesting. Using the gateway and therefor allowing every pod in the cluster, including user notebooks etc, control to any s3 resource via the gateway without further auth is (hopefully?) a no-go for most users.

xaniasd commented 4 years ago

I have used the minio gateway as a front to Azure Blob Storage, so it does have value until that protocol is supported as well. But I'm definitely in favor of making minio optional asap

discordianfish commented 4 years ago

So what would be the right way to get consensus on how to go forward? Is this gw based approach on the roadmap? Do we want/need to go down that route first, before eventually having native support?

To address the access control concern, another alternative would be running the gw as a sidecar in each pod we need it. But at that point we can probably just refactor this to track endpoint+bucket+path.

discordianfish commented 4 years ago

So this becomes a bigger concern for us. Can I submit a PR that would make the endpoint configurable? At least as a stop gap? It seems like that would be sufficient to allow the frontend to fetch the outputs. Currently I'm getting:

Response for path: artifacts/get?source=minio&bucket=xx&key=yy%2Fzz%2Fmlpipeline-ui-metadata.tgz was not 'ok'

Response was: Failed to get object in bucket xx at path yy/zz/mlpipeline-ui-metadata.tgz: Error: getaddrinfo ENOTFOUND minio-service.kubeflow
discordianfish commented 4 years ago

Hrm actually it looks like I've configured the apiserver already in /etc/ml-pipeline-config/config.json:

{
  "DBConfig": {
    "DriverName": "mysql",
    "DataSourceName": "",
    "DBName": "mlpipeline"
  },
  "ObjectStoreConfig": {
    "Secure": true,
    "Host": "s3.amazonaws.com",
    "BucketName": "xx"
  },
  "InitConnectionTimeout": "6m",
  "DefaultPipelineRunnerServiceAccount": "pipeline-runner",
  "ML_PIPELINE_VISUALIZATIONSERVER_SERVICE_HOST": "ml-pipeline-ml-pipeline-visualizationserver",
  "ML_PIPELINE_VISUALIZATIONSERVER_SERVICE_PORT": 8888
}
eterna2 commented 4 years ago

So this becomes a bigger concern for us. Can I submit a PR that would make the endpoint configurable? At least as a stop gap? It seems like that would be sufficient to allow the frontend to fetch the outputs. Currently I'm getting:

Response for path: artifacts/get?source=minio&bucket=xx&key=yy%2Fzz%2Fmlpipeline-ui-metadata.tgz was not 'ok'

Response was: Failed to get object in bucket xx at path yy/zz/mlpipeline-ui-metadata.tgz: Error: getaddrinfo ENOTFOUND minio-service.kubeflow

It is configurable. Frontend UI has a nodejs server which does a few stuff - most of it proxying to ml-pipeline, but artifacts retrieval is done in the nodejs server.

https://github.com/kubeflow/pipelines/blob/master/frontend/server/configs.ts

I am thinking there are many places to config for kfp. Should we have a unified configmap for everything? Or a crd to config the kfp deployment?

discordianfish commented 4 years ago

Ah thanks! Maybe consider moving this functionality into the apiserver? Seems excessive to have pipelines split into frontend, frontend-server and backend server

SatwikBhandiwad commented 4 years ago

I am trying to use ceph in place of minio.

I have set appropriate values in ml-pipeline-config and workflow-controller-configmap configmap.

I have also set the below env variables in ml-pipeline-ui deployment ARGO_ARCHIVE_ARTIFACTORY = 's3' AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY

When I run the pipeline I can see the data is stored in ceph (verified by connecting via client)

But from UI the source still shows minio: pipeline/artifacts/get?source=minio&bucket=mlpipelinejun8&key=artifacts%2Fconditional-execution-pipeline-f9m2m%2Fconditional-execution-pipeline-f9m2m-4255012536%2Fflip-coin-output.tgz

Also when I manually changed the source in the above URI to s3 its going to some default port and not the one configured in the configmap Failed to get object in bucket mlpipelinejun8 at path artifacts/conditional-execution-pipeline-f9m2m/conditional-execution-pipeline-f9m2m-4255012536/flip-coin-output.tgz: Error: connect ECONNREFUSED xx.xx.xx.xx:443

Am I missing something? @eterna2 @discordianfish @Jeffwan

eterna2 commented 4 years ago

@SatwikBhandiwad I am not very familiar with ceph. Is it fully compatible with minio/s3 - i.e. will the minio client work with ceph?

U probably should be look at these env var instead:

    MINIO_ACCESS_KEY = 'minio',
    MINIO_SECRET_KEY = 'minio123',
    MINIO_PORT = '9000',
    MINIO_HOST = 'minio-service',
    MINIO_NAMESPACE = 'kubeflow',
    MINIO_SSL = 'false',

Cuz for S3, we hardcoded (actually I forgot what I wrote, it is either hardcoded or some regex check) to the official AWS endpoint. The differeniation is specifically for AWS IAM instance profile credentials (i.e. if u deploy in AWS, we can get the credential from the metadata endpoint directly).

In non-AWS env, u should be using the MINIO_* implementations. There is no integration with the workflow controller configmap (sadly, maybe we should? - I will raise a ticket for this). So u will need to set 6 env var to be the same as the configmap.

ARGO_ARCHIVE_ARTIFACTORY specifically is for the pod logs archive but as a failover only when we can't get the pod log in the usual manner (from k8s or from the wf CR status).

SatwikBhandiwad commented 4 years ago

@eterna2 Thanks for the info. I configured those env vars and it worked.

I did not understand the workflow controller configmap part

There is no integration with the workflow controller configmap (sadly, maybe we should? - I will raise a ticket for this).

I have set workflow controller configmap with appropriate values and ran a workflow creating artifacts and it worked fine. Am I missing something?

eterna2 commented 4 years ago

Nope. U r not missing anything. I am just commenting that the problem u encountered is not uncommon - missing out the correct configs.

E.g. ml-pipeline, ml-pipeline-ui, and Argo (aka workflow controller) all share the same artifactory settings, but u still need to config them separately.

Ideally, all kfp components shld be able to access the config map to know the artifactory settings?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

discordianfish commented 4 years ago

@Jeffwan What's the current state of this?

lucinvitae commented 4 years ago

Create this parent issue to includes all known S3 problems in KFP and I also like to talk about more feature requests in this ticket.

I'm adding a link to tie this "parent" S3 support issue to a related "child" issue we recently encountered with getting tensorboard visualizations that specify S3 sources to work within KFP using AWS IAM roles: https://github.com/kubeflow/pipelines/issues/4364

It took our team a few months of trial-and-error configuration changes which we periodically abandoned and then resumed the task of getting tensorboard visualizations working between KFP version bumps (using v1.0.1 now).

I think the biggest gap that this tensorboard issue surfaced is missing documentation, examples, and support for configuring KFP to work on top of AWS rather than GCP. In addition to this, there are appear to be some general inaccuracies with the Python SDK documentation (see the boldface portion of my comment here which further complicated things).

Jeffwan commented 4 years ago

@discordianfish There's some dependency issues on minio javascript side to use IAM Role for service account. If you use credentials, there's no blocking issues.

Jeffwan commented 4 years ago

@lucinvitae Nice work. thanks for the contribution and I see your published issue,.

jphuynh commented 4 years ago

What's the current state of things on S3 support in Kubeflow pipelines? It's a little bit hard to follow with some work that seems to have been done towards that and the discussion around using Minio as gateway.

I'm still struggling to get ml-pipeline to initialize correctly against S3 instead of Minio and using IRSA instead of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and I can't seem to get more logs than that even trying to run apiserver with -v=10

$ kubectl logs -f ml-pipeline-7c6d7486d5-dglmh -n kubeflow
I1124 14:26:23.480484       6 client_manager.go:134] Initializing client manager
I1124 14:26:23.480626       6 config.go:50] Config DBConfig.ExtraParams not specified, skipping
F1124 14:26:23.884780       6 client_manager.go:356] Failed to check if Minio bucket exists. Error: Access Denied.

Is IRSA supposed to be supported as suggested in e2fyi/kubeflow-aws.

initMinioClient in pipelines/backend/src/apiserver/client_manager.go seems to suggest that it doesn't or am I missing something obvious?

Jeffwan commented 4 years ago

@jphuynh see https://www.kubeflow.org/docs/aws/pipeline/ Beside IRSA, all rest support has been added. IRSA requires minio client support, I think KFP service backend supports it as well. however, javascripts part is not supported https://github.com/minio/minio-js/issues/841 which means pipeline UI won't be able to use it at this moment

jphuynh commented 4 years ago

Thanks for your quick answer @Jeffwan.

I think KFP service backend supports it as well.

I don't think it does. Looks like the project is using github.com/minio/minio-go v6.0.14+incompatible and we would need at least v6.0.45

https://github.com/minio/minio-go/commit/a544e9f394b0e3e9a6f2b97da1084dc3a559b4b0

I'm not sure if it would be easy to bump the version.

Jeffwan commented 4 years ago

@jphuynh My mistake, I thought it get merged. Yeah, backend side needs to bump minio version as well.I documented in this issue https://github.com/kubeflow/pipelines/issues/3398. I will see if I get time to continue on the stable PR

PatrickXYS commented 4 years ago

Let me put some efforts here to solve the issue

PatrickXYS commented 4 years ago

/assign

karlschriek commented 3 years ago

I would like to bump this. Has there been any progress here @PatrickXYS? At the moment it is really hard to figure out what is needed to make pipelines work with S3. There seem to be several approaches that community members have hacked themselves, but no clear consensus yet on the standard way to do this.

HassanOuda commented 3 years ago

Also, why is this line hardcoding S3 endpoint as the amazon endpoint? We have a different S3 instance not in Amazon and minio endpoint doesn't work.

https://github.com/kubeflow/pipelines/blob/1577bdb41913613f6268366b6e6e20fdfddde693/backend/metadata_writer/src/metadata_writer.py#L89

and here:

https://github.com/kubeflow/pipelines/blob/1577bdb41913613f6268366b6e6e20fdfddde693/frontend/server/aws-helper.ts#L55

lastly:

https://github.com/kubeflow/pipelines/blob/1577bdb41913613f6268366b6e6e20fdfddde693/frontend/src/lib/AwsHelper.ts#L23

aaron-arellano commented 3 years ago

Is assume-web-identity-role to use IAM Role for Service account still an open issue? The thread is not clear on whether this works or not. I am looking to use IAM Role for service account to connect ML Pipeline and Argo artifacts to S3 using this annotation eks.amazonaws.com/role-arn: <your-arn-role> @Jeffwan

karlschriek commented 3 years ago

Doesn't work yet. Minio version used by pipelines is currently still sitting at github.com/minio/minio-go v6.0.14+incompatible. I think this needs to be merged: https://github.com/kubeflow/pipelines/pull/3549

karlschriek commented 3 years ago

Also this issue also needs to be resolved: https://github.com/minio/minio-js/issues/841. As far as I can tell from piecing together the various PRs and issues on this topic we need both minio-go and minio-js to implement this and then we need the versions to be bumped on KFP's side

aaron-arellano commented 3 years ago

thanks for the response. hopefully there is resolution to this soon. I know a lot of people would love to use IRSA for S3 to avid using credentials or third party K8s IAM providers...

Bobgy commented 3 years ago

Hi everyone, for KFP v2 compatible mode, we've switched to use https://gocloud.dev/ to access S3. Can someone help me verify whether it supports your use-cases especially IRSA and how? You can find more about the design in bit.ly/kfp-v2-compatible

karlschriek commented 3 years ago

Hey @Bobgy, I could help with this. Could you give me some more info on what you would like to verify exactly? I guess this will replace the Minio Go and JS clients that are currently in use?

The link bit.ly/kfp-v2-compatible gives me a 404 error.

Bobgy commented 3 years ago

Thanks @karlschriek! Yes, we'd like to replace MinIO Go with Go CDK (also KFP tasks in pipelines will use Go CDK to upload/download from storage). So it'll be helpful to verify whether Go CDK (as an alternative client for S3) can support the IRSA you are talking about.

Re the 404 error, sorry I made a mistake, fixed the link.

karlschriek commented 3 years ago

Ok, it looks promising at least: https://github.com/google/go-cloud/pull/2773.

IRSA works by attaching annotations to the ServiceAccounts that get attached to Pods. So all we would need to do is attach the right annotations to the ServiceAccounts for the pipelines api etc, as well as whatever sa is used for the pipeline jobs.

I'll provide you with some small examples and also try out the Go SDK.