pangeo-data / pangeo-eosc

Pangeo for the European Open Science cloud
https://pangeo-data.github.io/pangeo-eosc/
MIT License
3 stars 3 forks source link

How to link a jupyterlab instance running with EGI-dev-checkin to object store at CESNET? #14

Closed tinaok closed 1 year ago

tinaok commented 1 year ago

On our Foss-4g test EOSC instance jupyterlab interface, we have S3 Object Storage Browser.

Here I guess endpoint URL would be 'https://object-store.cloud.muni.cz' But what would be the Access Key ID or Secret Access Key? I tried to use token from https://aai.egi.eu/token/ as session Token. But I got following error.

S3 Authentication Error
An error occurred (InvalidArgument) when calling the ListBuckets operation: Unknown-
sebastian-luna-valero commented 1 year ago

Hi,

Here are the docs regarding S3 access: https://docs.egi.eu/users/data/storage/object-storage/#access-via-the-s3-protocol

However, at the moment the configuration of access and secret keys for write access need to be done manually by OpenStack admins at each site, which is not ideal. Read-only access is possible as explained in the docs.

Instead, we have updated our documentation to include Rclone: https://docs.egi.eu/users/data/storage/object-storage/#access-via-rclone

Could we try this out with an example notebook running on https://pangeo-foss4g.vm.fedcloud.eu/jupyterhub/ ?

guillaumeeb commented 1 year ago

Hi @sebastian-luna-valero @tinaok,

I think there is a more general question than the one from Tina above: how to have read/write access to an object storage from Jupyterlab, be it from a browser or from a notebook cell?

The ultimate goal is to be able to write on this object store Zarr/NetCDF/GeoTIFF datasets using Dask clusters deployed over Kubernetes and Xarray APIs. Being able to browser them with Jupyterlab is maybe a bit less important.

If we can have S3 keys, this is great, even if this is manual for each of us. But I guess this means that only trusted people (and not trainees from a workshop) could have one.

Using Rclone is not an option. This could be useful for browsing and exploring the object store, but can't be use to write with Xarray.

Finally, we could dig into the possibility of using directly Swift through the corresponding fsspec implementation.

sebastian-luna-valero commented 1 year ago

Hi,

Finally, we could dig into the possibility of using directly Swift through the corresponding fsspec implementation.

I had a brief look at https://pypi.org/project/swiftspec/ (is this the fsspec implementation for Swift?) and I wasn't convinced. You need to specify the user account in the path to get the object:

import fsspec

with fsspec.open("swift://server/account/container/object.txt", "r") as f:
    print(f.read())

Instead I successfully tested https://pypi.org/project/zarr-swiftstore/ with:

auth = {
    "preauthurl": os.environ["OS_STORAGE_URL"],
    "preauthtoken": os.environ["OS_AUTH_TOKEN"],
}

The value for these environment variables can be obtained the same way as for rclone in: https://docs.egi.eu/users/data/storage/object-storage/#access-via-rclone

Could you please give it a try and let me know how it goes?

If we can have S3 keys, this is great, even if this is manual for each of us. But I guess this means that only trusted people (and not trainees from a workshop) could have one.

I thought the goal was to enable read/write access to an object store for everybody. If only trainers should need write access, then asking the site admins for S3 credentials is an option.

guillaumeeb commented 1 year ago

I had a brief look at https://pypi.org/project/swiftspec/ (is this the fsspec implementation for Swift?)

Yes, I guess this is the one with the code source here: https://github.com/fsspec/swiftspec. The advantage of an fsspec implementation is that you can use it with Zarr, but also for other file formats. We could also consider contributing to this fsspec package to improve it. How is it a problem to have an account in the URL? I see on the Readme that it also uses the same environment variable (https://github.com/fsspec/swiftspec#authentication).

Anyway, the zarr-switfstore package looks also really promising and might solve part of the problem!

We need to find all the resources for using Pangeo on a standard Openstack deployment, this is certainly one of them.

I will try the two approaches in the next days.

I thought the goal was to enable read/write access to an object store for everybody. If only trainers should need write access, then asking the site admins for S3 credentials is an option.

My personal point of view is that it is really not crucial for trainees to have access to object store write. However, we will also develop real use case at scale on this platform, and application developers such as @pl-marasco, @acocac or @tinaok really need to be able to write on this store. I think we should explore the S3 credentials option too!

sebastian-luna-valero commented 1 year ago

Hi,

How is it a problem to have an account in the URL?

Maybe I am missing something basic here, but if I write a pipeline to create:

"swift://server/<sebastian-account>/container/object.txt"

Would you need my credentials to access it? and if I upload my pipeline to GitHub, will somebody be able to re run it?

I did try using OS_STORAGE_URL and OS_AUTH_TOKEN but the tests didn't work for me. If you get an example with them that do not use <account> in the URL, please share it and I will give it a go.

I think we should explore the S3 credentials option too!

Again the problem I see here is with reproducibility and self-service. Something that might not be relevant during the testing phase but it will become a major issue when moving to production. So in my personal opinion, this would be the last resort.

sebastian-luna-valero commented 1 year ago

xref: https://github.com/fsspec/swiftspec/issues/6

guillaumeeb commented 1 year ago

Thanks @sebastian-luna-valero for all the work and discussion here!

So yes, after testing a bit, it seems that https://github.com/fsspec/swiftspec is not really compatible with our needs or CESNET Swift store. It uses an URL to pass a lot of arguments, and just parse it to make the request to the object store, but this don't work for our settings.

From what I can see on Openstack Dashboard, we should access Switf with a URL like https://object-store.cloud.muni.cz/swift/v1/pangeo-test/. So I guess our server is https://object-store.cloud.muni.cz/swift.

So this leaves us currently with zarr-swiftstore (was not able to test due to my connection problems), which should solve the Zarr distributed writing and reading, and also S3 anonymous access for reads through s3fs.

I think a really good thing for the community would be to implement a correct fsspec implementation for Swift, either developing above swiftspec, either starting from scratch. We could probably get a bit of support from Pangeo community and fsspec maintainers. But I've no idea how hard this would be.

Finally, about the S3 credentials, I understand your concern, but I'm not sure if we want to reach something as "production" here. Or not in short to medium term. The infrastructure is made for workshops and for scientific research, which rarely have production concerns. Could we at least get some credentials to test? How can we request them?

sebastian-luna-valero commented 1 year ago

Hi,

Revisiting this topic today, I just realised that CESNET provides these instructions (which is not available across all EGI Cloud providers): https://docs.cloud.muni.cz/cloud/advanced-features/#s3-credentials

Therefore, getting S3 credentials is self-service:

Use fedcloudclient following instructions in https://github.com/pangeo-data/pangeo-eosc/pull/15:

fedcloud openstack --site CESNET-MCC --vo vo.pangeo.eu ec2 credentials create 
fedcloud openstack --site CESNET-MCC --vo vo.pangeo.eu ec2 credentials list

I just tested it with the S3 Object Storage browser on https://pangeo-foss4g.vm.fedcloud.eu/jupyterhub/ and it works.

Please give it a try and let me know how it goes.

guillaumeeb commented 1 year ago

I just tested the Jupyterlab object storage browser extension with these instructions, and it works perfectly well.

I'm going to close this issue as I merged #15.

Instruction on how to obtain a S3 access/secret keys pair can be found here: https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI-CLI-Swift-S3.md#retrieve-s3-credentials.

Thanks a lot @sebastian-luna-valero!