[Sanger] Storage configuration

alimanfoo commented 5 years ago

Raising this issue as a place to discuss storage options and configuration for the Sanger datalab deployment.

alimanfoo commented 5 years ago

Hi @roamato, @tnguyensanger, I've been chatting with @slejdops and he tells me you're experimenting with using an NFS server to get shared storage for datalab at Sanger. He also mentioned that there are some constraints imposed by the network that mean it's currently difficult to get data from lustre to the NFS store. Just wanted to say happy to discuss or talk through options if it would help (although I'm going on leave for 2 weeks so apologies for radio silence for a while).

If you are only using Python then one possibility is to use the s3fs package directly, which can be used to open, read and write objects with the same API as opening files. E.g., this kind of thing works:

>>> with s3.open('mybucket/my-file.csv.gz', 'rb') as f:
...     g = gzip.GzipFile(fileobj=f)  # Decompress data with gzip
...     df = pd.read_csv(g)           # Read CSV file with Pandas

We're going down this route for the Google datalab deployment, using the equivalent gcsfs package. I.e., we're not trying to mount any storage, just interact with object storage directly. Although that does only work if you're using Python; if you need to use R or run command line programs then I imagine you might need a different approach. In any case I'd be interested to know more about your use cases.

roamato commented 5 years ago

Thanks @alimanfoo Reality is a bit more cumbersome and basically boils down to OS being a very closed sandbox with limited (and tightly controlled) ingress and egress points. We already know that S3 works in principle, but also that is not quite as simple as one would like. I'll do some investigation but a really usable solution might require some redesign. I'll keep you posted.

alimanfoo commented 5 years ago

Hi Rob, no worries, thought I would check in. Catch you in a couple of weeks.

On Wed, 14 Aug 2019, 09:20 Roberto Amato, notifications@github.com wrote:

Thanks @alimanfoo https://github.com/alimanfoo Reality is a bit more cumbersome and basically boils down to OS being a very closed sandbox with limited (and tightly controlled) ingress and egress points. We already know that S3 works in principle, but also that is not quite as simple as one would like. I'll do some investigation but a really usable solution might require some redesign. I'll keep you posted.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/malariagen/datalab/issues/68?email_source=notifications&email_token=AAFLYQWPSU7K5D2INF6QGXTQEO54FA5CNFSM4ILLR5KKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4IB7KA#issuecomment-521150376, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQTMPGVYOIH5TPNIO6DQEO54FANCNFSM4ILLR5KA .

malariagen / datalab

[Sanger] Storage configuration #68