Closed alimanfoo closed 5 years ago
cc @roamato, @slejdops
Good example notebook to aim to get running is here:
https://github.com/malariagen/datalab/blob/master/examples/scikit-allel-example.ipynb
I've included some commented out code which you would need to use for reading data from Ceph S3 instead of GCS, that should give you a place to start although exact params might need tweaking.
@alimanfoo
Data are now accessible (see #55), let me know if you have time to give it a spin or you want me to do it.
It would be great to have a working example in place for next Wed when @slejdops and @idwright will meet with the Sanger webteam.
Cool, I'll catch up with @slejdops tomorrow, see if we can get something running.
@alimanfoo
I get the following error when loading the file:
>>> storage_path = 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1'
>>> import s3fs
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='us-east-1', endpoint_url="https://cog.sanger.ac.uk"))
>>> store = s3fs.S3Map(root=storage_path, s3=s3, check=False)
>>> callset = zarr.Group(store)
ValueError: group not found at path None`
The bucket is accessible though:
>>> s3.ls(storage_path)
['ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/.zattrs',
'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/.zgroup',
'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/2L',
'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/2R',
'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/3L',
'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/3R',
'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/UNKN',
'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/X',
'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/Y_unplaced',
'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1/samples']
Did I omit to upload something?
Hi @roamato, can you try the following and tell me what you get:
>>> storage_path = 'ag1000g-release/phase2/AR1/variation/main/zarr2/ag1000g.phase2.ar1'
>>> import s3fs
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='us-east-1', endpoint_url="https://cog.sanger.ac.uk"))
>>> store = s3fs.S3Map(root=storage_path, s3=s3, check=False)
>>> list(store)[:10]
>>> store['.zgroup']
When @slejdops and I were looking at this the other day, we found that although we had permission to list the bucket, we did not have permission to read any of the objects. So you will probably need to make all the objects in the bucket world-readable.
TIL that when you make a bucket public, you only make the bucket public and not its content. The following seems to solve the problem:
$ s3cmd setacl --acl-public -r s3://ag1000g-release
Cool :+1:
Are you able to run the full notebook now?
@slejdops and myself just did and we are satisfied that it works using the Sanger S3 bucket. Closing the issue.
There is something odd with the dashboard but if it persists I'll open a separate issue.
Great, that's a nice milestone to have achieved :beer:
Get a demo of zarr, scikit-allel and dask all working together on the Sanger OpenStack deployment. Basically same as #8 but on Sanger OpenStack instead of GCP.