pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
701 stars 189 forks source link

Google Drive on Pangeo #288

Closed tjcrone closed 4 years ago

tjcrone commented 6 years ago

During yesterday's sprint there was some discussion of Google Drive as a place to store data, since many people with academic accounts have unlimited storage on Drive. I have been playing with this today, configuring it according to these instructions, and I'm finding that it works really well, and is relatively fast. I'm getting about 400 mb/s transfers from my Drive storage into a stand-alone GCE VM. Not sure how it would scale across a cluster. I have noticed that there may be a bug that could potentially cause the cache to grow very large which would be bad, but it's nothing that rm cannot fix. Anyway, I can envision many beneficial uses of this for Pangeo users.

rsignell-usgs commented 6 years ago

and just a reminder the free-and-easy-to-install Rclone can sync between Google Drive and S3, GCS and linux file system.

rabernat commented 6 years ago

Does this use fuse to expose the google drive to the compute nodes?

tjcrone commented 6 years ago

I haven't yet connected worker pods to my Drive, but I don't see any obvious reason why it wouldn't be relatively easy to do. The only difficult part might be authentication, but I'm pretty sure we could find a way to pass all the needed tokens and IDs to the workers when they are created.

tjcrone commented 6 years ago

@rsignell-usgs, rclone looks really powerfull, but it's main function appears to be copying files from one place to another, and that might not be the way to go if the Drive has many TBs of data. It does appear to have an experimental "mount" functionality which looks like it might offer something similar to google-drive-ocamlfuse. Perhaps we should investigate which one of these works better?

rabernat commented 6 years ago

By compute nodes I mean notebooks and workers. My question is just "does this use fuse in some way"?

tjcrone commented 6 years ago

Pretty sure it does. It is called google-drive-ocamlfuse, and the title of the readme is "FUSE filesystem over Google Drive".

rsignell-usgs commented 6 years ago

@tjcrone, I should have provided more context. I understand the goal here is to read directly from Google Drive. I just wanted to add a footnote that if that doesn't work out, that there is an easy way to sync using Rclone with other storage systems.

rsignell-usgs commented 6 years ago

@tjcrone, ah you are you talking about this experimental feature of Rclone, right? https://rclone.org/commands/rclone_mount/

tjcrone commented 6 years ago

Working off of ideas from here, I managed to get Google Drive to return a byte-range from a file of any size, directly from the file ID, using simple Python. No need for any auth as long as the file is shared to everyone by link. Here is a relevant code snippet:

def _get_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value
    return None

def get_drive_bytes(id, byte_range):
    URL = "https://docs.google.com/uc?export=download"
    session = requests.Session()
    headers = {"Range": "bytes=%i-%i" % byte_range}
    response = session.get(URL, params = { 'id' : id }, stream = True, headers = headers)
    token = _get_token(response)
    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True, headers = headers)
    return response.content

byte_range is a tuple. It's easy to share lots of files just by sharing the folder they are in (anyone with link can view), and a mapping of filenames to file ids can be obtained with rclone:

rclone lsf --format pi --csv drive:folder

rclone is a pretty nice tool. I would definitely recommend checking it out.

In terms of performance, I am getting about 1 Gb/s from Drive to my Azure cluster using only the notebook. Have not yet tried scaling out to workers. Also I do not know if Drive will limit my requests or request rate. Would be great to hear from others who try this.

rabernat commented 6 years ago

This sounds amazing Tim. Could solve our problem of "how do users access their data". Most people have google drive and are comfortable putting data in there.

Can you figure out how to make it work with zarr? 😉

tjcrone commented 6 years ago

@rabernat, write to Drive is a whole different ball of wax. However I am seriously considering expanding my example code into a Drive API that would provide the limited functionality needed by zarr.

rabernat commented 6 years ago

I'm just talking about reading for now.

The basic scenario I have in mind is

However, writing back to google drive would also be great!

Is it possible to just make a generic fuse mount of a google drive folder and access it as a regular filesystem?

rabernat commented 6 years ago

@martindurant: I'm wondering if you could provide some advice about the best way to proceed with this. You have already developed similar things for gcs / s3. Is it worthwhile trying to do something similar for google drive?

martindurant commented 6 years ago

There does seem to be something of a python client which may be usable for download of whole files and a much better documented REST API which looks general enough to be used in a gcsfs-like wrapping. Reusing the auth in gcsfs did not immediately get me permission to drive, so presumably it would be some effort to get things working.

martindurant commented 6 years ago

Correction, the gcs auth method can work, but the project needs to have the drive API enabled.

> gcs = gcsfs.GCSFileSystem()
> gcs.scope = 'https://www.googleapis.com/auth/drive'
> gcs.connect(method='browser')
> gcs.session.get(url + '/files')
{'kind': 'drive#fileList',
 'nextPageToken': '~!!~AI9FV7Sn_DzWm3xg7rur8FXBqx5LddpyGAfLSlDQWpqlsA35pSorwn8yqGjQt7m3uFi5v7bDWMx3sDob9cl6v4kZ4PDFElzsSHJBaCUMf282cxlJeaK99Y_VdO2tsf4dfzj_OluP8n8GNSoHYmKAgyjVYjSnaYyVtkv700shvLamPHvrAnOHPyEaJjhFoq4DpMGcwdwRuQQJQLlWCYN4qzY2UsDFQO0j7Q5n-yVyoqYTFF7OJ6Ctre043ieNkVneIr1s8SwBRRDBUOuUc_JQC3x9f4LEEmQXDV0hElEC33ltQqN6ZT4CpV24IrUBpHz-7OqLuRNbQyD20Xat764CcyRlMfRwsrZRkz1bMuwgE96PphTlpR9JgyviCgKgM0CTWyjTYzgN-14D',
 'incompleteSearch': False,
 'files': [{'kind': 'drive#file',
   'id': '1nNZgioBntUogyfVFZ0o1twHDtU77huYv94Ih-Wk9x6g',
   'name': 'Community vs. Enterprise Repo Build Delivery',
   'mimeType': 'application/vnd.google-apps.document'},
...

(does not, apparently, give size of files straight off)

rabernat commented 6 years ago

Just for some context here, all Columbia affiliates get unlimited storage on google drive for free. We know of people who have 100s of TB of files stored this way. I imagine many other universities have similar agreements.

martindurant commented 6 years ago

How do they access such data currently? I can't imagine downloading many GB via a browser, and I imagine it's not simple with wget/curl.

rabernat commented 6 years ago

Globus is the best option for large transfers: https://www.globus.org/connectors/google-drive

martindurant commented 6 years ago

So that means always copying the datasets in their entirety to local storage? I suppose if it provided FUSE-like mounting, then we wouldn't be having this discussion.

rabernat commented 6 years ago

Globus transfers can come from any valid globus endpoint. Doesn't have to be a local system. I frequently use globus to transfer data from one HPC to another. It never touches the computer from which I initiate the transfer.

martindurant commented 6 years ago

Understood, but the data gets copied so that it's local to the processing, not the gcsfs model which accesses data as files but only downloads chunks on demand, to be kept ephemerally in memory - that is why you would be interested in a google drive FS gdsf, perhaps with FUSE?

tjcrone commented 6 years ago

There are at least a couple of Google Drive FUSE solutions, including rclone mount and google-drive-ocamlfuse. Both work pretty well.

Nonetheless in my experience GCS/Drive authentication flows can range from complicated to very-complicated, and I think there might be room for a no-auth read-only solution, especially for new users. I'm worried about the possibility of no-auth limits, but we need to test this to learn about any such potential limits.

pbranson commented 6 years ago

This is really cool, we also have unlimited storage on Google drive at the University of Western Australia

https://stackoverflow.com/questions/14156781/where-can-i-find-the-price-list-for-google-drive-api

Seems requests are unlimited in practical terms!

On Fri, Jun 22, 2018 at 6:48 AM, Tim Crone notifications@github.com wrote:

There are at least a couple of Google Drive FUSE solutions, including rclone mount https://rclone.org/commands/rclone_mount/ and google-drive-ocamlfuse https://github.com/astrada/google-drive-ocamlfuse. Both work pretty well.

Nonetheless in my experience GCS/Drive authentication flows can range from complicated to very-complicated, and I think there might be room for a no-auth read-only solution, especially for new users. I'm worried about the possibility of no-auth limits, but we need to test this to learn about any such potential limits.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/288#issuecomment-399267090, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3bQKoNMiU6GrnhSpDzpMvvzlHNu2uFks5t_CKqgaJpZM4Uc6Ap .

tjcrone commented 6 years ago

I can see authenticated requests being very high, and easy quota increases as well, obviously linked to your account. I am not yet convinced that unauthenticated requests will be this high. Would be great to test this.

martindurant commented 6 years ago

rclone mount, in particular, seems like a pretty complete solution for multiple cloud services. Maybe no one should be relying on gcsfs/fuse?

rabernat commented 6 years ago

Can anyone think how we could incorporate rclone mount with proper authentication into pangeo.pydata.org? How would users specify their credentials? We would also need the mount on the worker nodes in order to access data from there.

It seems that Columbia's unlimited storage plan excludes public sharing:

In order to maintain a secure collaboration and sharing environment, some risky sharing options are not available on LionMail Drive. Most notably, Drive items cannot be set to any level of “Public” sharing. This is because the "public on the web" setting makes the document completely public and available for indexing, which means it will appear in search engine results and would be available to every user who has a Google account.

You may still generate sharable links and share these with users outside of Columbia, however you must share the document with the email addresses of the specific users that you would like to have access to your file.

So if we want to take advantage of that, we need authentication.

rabernat commented 6 years ago

There is also the google drive for jupyter notebook extension: https://github.com/jupyter/jupyter-drive

From a user point of view, the ideal thing would be the following:

tjcrone commented 6 years ago

I have not found any restrictions regarding public sharing. I am able to set my LDEO Drive files to "Public on the web", which includes indexing, and "Anyone with the link", which appears to be essentially the same but without search engine indexing.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

sofroniewn commented 4 years ago

I know it's been a while since the last comment on this thread, but I was wondering what the current best practices are for reading zarr files (or other files) from google drive is - like many people on this thread I have access to large google drive resources, but still to the best of my understanding tools like gcsfs work with google cloud storage which is different. I havn't really checked out jupyter-drive, but I'm not sure that's really what I'm looking for either.

Ideally I'd do something like take my url https://drive.google.com/drive/u/0/folders/path_to_my_zarr_file and do something like zarr.open(url) or dask.array.from_zarr(url) and it all would work (maybe if I provide some credentials too), but I don't think it will right now. I can cross post this on the zarr or dask issue trackers too, but thought I'd ask this community first given the content of this thread.

mrocklin commented 4 years ago

When I've had to deal with large datasets stored on Google Drive I've tended to download everything locally, and then push it up to a proper object store. I haven't found a nice way to interact with these files directly from Python. I didn't do an exhaustive search though, and would be happy to learn of something new.

sofroniewn commented 4 years ago

Yeah, makes sense. Googling didn't come up with anything for me. If there was a way to make this work it could make cloud hosting of datasets accessible to a much larger group of people that don't have other accounts at the major cloud providers etc.

mrocklin commented 4 years ago

If you find a good Python API the way to integrate it with Dask is here: https://docs.dask.org/en/latest/remote-data-services.html

sofroniewn commented 4 years ago

Ok cool, thanks. Maybe i'll give it a go! https://github.com/gsuitedevs/PyDrive could be a good place to start as it is built on top of and simplifies https://github.com/googleapis/google-api-python-client

martindurant commented 4 years ago

Yep, basically it comes down to writing an implementation for fsspec, which should be fairly easy and there are several examples. The real question is to what extent gdrive functions like a file-system: does it support listing a hierarchical folder structure, and for a given file, can you load only specified byte ranges fro a file? For the zarr case, it might already work as an "HTTPS file system" if the URL of the target can be generated, and the URLs of the components can be logially constructed by using relative paths, and particularly if the dataset has "consolidated" metadata.

The first thing I saw on the pydrive site, though, is that you must log into the google dev console and create your own oauth client. If you can do that, then you should probably be using GCS...

sofroniewn commented 4 years ago

Indeed @martindurant, turns out I was unable to get passed the login stuff for pydrive (I lacked permission to create the necessary project) so I don’t think that was the solution I was looking for!

Thanks for the tip on fsspec

martindurant commented 4 years ago

gcsfs maintains its own public client for this kind of purpose, so it may be that the problem is fairly easily surmountable.

martindurant commented 4 years ago

( ^ that's specifically for browser-based auth. It may be that the whole of the auth stuff in gcsfs is reusable)

martindurant commented 4 years ago

See @rabernat 's work which already is usable (read-only) at https://github.com/rabernat/fsspec-google-drive/blob/master/gdrivefs.py We can iterate there on making a full package.

rabernat commented 4 years ago

Binder

😉

martindurant commented 4 years ago

Got it to work locally, will be fun to work with. I would need to run browser auth on binder, right?

Quick note (maybe should be issue on new repo): whereas ls(path) produces the same whether or not path starts or ends with "/" (and sometimes misses cache), ls("/") inserts additional leading "/" compared to ls(""). Should decide whether we think "/" is the root of the FS or "".

mrocklin commented 4 years ago

This seems like it might still be live. Reopening.

What is left to do here to get something that is easy to play with? Are things ready to publish on PyPI?

martindurant commented 4 years ago

Stuck on https://github.com/intake/gdrivefs/issues/7 , unfortunately. I have tried a little to get google to approve the similar client for gcsfs, but haven't had feedback from a real person yet. We do not meet their typical use case, i.e., a website or mobile app.

The dropbox implementation seems to allow you a normal way to sign in.

mrocklin commented 4 years ago

@lila do you know anyone at GCP or Google Drive who could help unstick things here? This is stopping a lot of academic HPC science users from using Dask on Google for biomedical imaging workloads (and presumably others)

martindurant commented 4 years ago

As far as I know from brief testing, though, the gdrive package is usable. We did not come up with a reasonable way to run tests, would prefer not to go down the vcrpy route again.

rabernat commented 4 years ago

gdrive works. It's just incredibly slow. If anyone wants to play around with it and try to understand why, that would be useful.

martindurant commented 4 years ago

That sounds solid and actionable! Now simply to find someone with some time on their hands...

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.