treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.46k stars 355 forks source link

Add support to AWS Java SDKs s3Client.doesBucketExistV2(bucketName) call #8187

Open tpang-cxl opened 1 month ago

tpang-cxl commented 1 month ago

Currently, we use Apache's Dolphin Scheduler for scheduling our pipelines. But when set the S3 endpoint to our LakeFS server, we get an error as:

com.amazonaws.services.s3.model.AmazonS3Exception: This operation is not supported in LakeFS (Service: Amazon S3; Status Code: 405; Error Code: ERRLakeFSNotSupported; Request ID: 40a1e78b-f23e-4b04-8294-4338345d7c74; S3 Extended Request ID: CB07BFE5E44B0E5F; Proxy: null)

This error doesn't happen if we switch our storage directly to MinIO. We are now using this workaround. However, this workaround is not desirable as we don't wan to expose our MinIO storage directly (bypassing LakeFS) to our API calls. If LakeFS can support this call, we can disallow direct access of MinIO again

itaiad200 commented 1 month ago

Root cause is probably as described here

arielshaqed commented 1 month ago

Root cause is probably as described here

If true, that root cause says the issue is the getBucketAcl call. Uggh.

The doesBucketExistV2 docs also state that it performs this call.

So I would like to scope this clearly: we can add a workaround to cause doesBucketExistV2 of the deprecated AWS Java SDK v1 to work. It will involve sending a minimal response to getBucketAcl back to the client. Of course that response may cause other S3 clients to give strange results when used with lakeFS, but I feel reasonably confident that a response "this bucket allows anything to authorized users" or similar will give good results.

The AWS Java SDK v2 does not have such a method, and instead says to use headBucket. So we believe that code using the v2 SDK will work today.

tpang-cxl commented 1 month ago

Thanks for the update @arielshaqed. Yes please add the SDK v1 support, that will be very helpful. We have quite a few 3rd party tools/apps/libs that are still using AWS SDK V1---and there is no sign that all these tools will upgrade their AWS SDK version---meaning we have no choice but have to live with AWS SDK V1 for some time.

Very much appreciate you spent time to look into this and provide a working(I believe so) solution. Looking forward to hearing back from you and know the version you have this support so we can upgrade our LakeFS installations.