malariagen / datalab

Repo for files and issues related to cloud deployment of JupyterHub.
MIT License
0 stars 1 forks source link

zarr/scikit-allel/dask demo on Sanger OpenStack #56

Closed alimanfoo closed 5 years ago

alimanfoo commented 5 years ago

Get a demo of zarr, scikit-allel and dask all working together on the Sanger OpenStack deployment. Basically same as #8 but on Sanger OpenStack instead of GCP.

alimanfoo commented 5 years ago

cc @roamato, @slejdops

Good example notebook to aim to get running is here:

https://github.com/malariagen/datalab/blob/master/examples/scikit-allel-example.ipynb

I've included some commented out code which you would need to use for reading data from Ceph S3 instead of GCS, that should give you a place to start although exact params might need tweaking.

roamato commented 5 years ago

@alimanfoo

Data are now accessible (see #55), let me know if you have time to give it a spin or you want me to do it.

It would be great to have a working example in place for next Wed when @slejdops and @idwright will meet with the Sanger webteam.

alimanfoo commented 5 years ago

Cool, I'll catch up with @slejdops tomorrow, see if we can get something running.

roamato commented 5 years ago

@alimanfoo

I get the following error when loading the file:

>>> storage_path = 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1'
>>> import s3fs
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='us-east-1', endpoint_url="https://cog.sanger.ac.uk"))
>>> store = s3fs.S3Map(root=storage_path, s3=s3, check=False)
>>> callset = zarr.Group(store)
ValueError: group not found at path None`

The bucket is accessible though:

>>> s3.ls(storage_path)
['ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/.zattrs',
 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/.zgroup',
 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/2L',
 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/2R',
 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/3L',
 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/3R',
 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/UNKN',
 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/X',
 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/Y_unplaced',
 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/samples']

Did I omit to upload something?

alimanfoo commented 5 years ago

Hi @roamato, can you try the following and tell me what you get:

>>> storage_path = 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1'
>>> import s3fs
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='us-east-1', endpoint_url="https://cog.sanger.ac.uk"))
>>> store = s3fs.S3Map(root=storage_path, s3=s3, check=False)
>>> list(store)[:10]
>>> store['.zgroup']

When @slejdops and I were looking at this the other day, we found that although we had permission to list the bucket, we did not have permission to read any of the objects. So you will probably need to make all the objects in the bucket world-readable.

roamato commented 5 years ago

TIL that when you make a bucket public, you only make the bucket public and not its content. The following seems to solve the problem:

$ s3cmd setacl --acl-public -r s3://ag1000g-release
alimanfoo commented 5 years ago

Cool :+1:

Are you able to run the full notebook now?

roamato commented 5 years ago

@slejdops and myself just did and we are satisfied that it works using the Sanger S3 bucket. Closing the issue.

There is something odd with the dashboard but if it persists I'll open a separate issue.

alimanfoo commented 5 years ago

Great, that's a nice milestone to have achieved :beer: