Provide an R interface to AARNet CloudStor

timchurches commented 7 years ago

Related to issue #38.

For Australian researchers, AARNet provides a range of wonderful resources, the best known of which is their high-speed fibre optic research network linking all universities and many research institutes and organisations across Australia.

One of the services they provide to researchers is CloudStor, which provides high-reliability data storage and interchange facilities with high-speed access for researchers in institutions connected to AARNet (all Australian universities), as well as internet accessibility. In particular CloudStor provides up to 100GB of storage to each researcher, and up to 1 TB for research groups, free of charge (and more if required for a modest charge).

There is a RESTful API for the FileSender interface to CloudStor. Is it feasible to build an R interface to leverage this to do useful things?

adamhsparks commented 7 years ago

While I cannot make it to this year's Unconf, I'd sure appreciate something like this and would contribute where I could.

timchurches commented 7 years ago

OK, after a quick look at this:

FileSender is pretty nifty, in that it allows you to send arbitrarily large files to recipients in the form of an email containing links to download the files (Google and Apple now use similar methods when you email large attachments). Having programmatic access to that capability from R could be handy. However, it doesn't really address the problem of helping a team of researchers synchronise versioned data files via some cloud storage mechanism. And the API is a bit complex (at least I found it confusing).
however, the basic CloudStor service of in-the-cloud file storage and sharing, for individuals and groups, has a webDAV interface, so that should be easier to use. A CloudStor user can set up a directory (folder) in CloudStor which they can then share on a read-only or read-write basis with other CloudStor users which they nominate. Thus, a research team can easily be set up to both read and write files to a CloudStor folder in one (or more) of the team members CloudStor accounts.
webDav access via cURL documentation: https://doc.owncloud.org/server/8.0/user_manual/files/access_webdav.html#accessing-files-using-curl

OK, that looks promising!

So, leveraging this gist by a viking, let's see if we can access CloudStor from R. Note that the password parameter is NOT the password you use to log in to the CloudStor web interface (you authenticate to that via AAF using your university credentials), it's the special "Sync" password you set in the My Account page of the web interface.

> library(curl)
> library(XML)
> 
> listFiles <- function(username, password, relPath = "/", dav = "https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/") {
+   uri <- URLencode(paste(dav, relPath, sep=""))
+ 
+   # fetch directory listing via curl and parse XML response
+   h <- new_handle()
+   handle_setopt(h, customrequest = "PROPFIND")
+   handle_setopt(h, username = username)
+   handle_setopt(h, password = password)
+   response <- curl_fetch_memory(uri, h)
+   text <- rawToChar(response$content)
+   doc <- xmlParse(text, asText=TRUE)
+ 
+   # calculate relative paths
+   base <- paste(paste("/", strsplit(uri, "/")[[1]][-1:-3], sep="", collapse=""), "/", sep="")
+   result <- unlist(
+     xpathApply(doc, "//d:response/d:href", function(node) {
+       sub(base, "", URLdecode(xmlValue(node)), fixed=TRUE)
+     })
+   )
+   result[result != ""]
+ }
> 
> listFiles(username="Tim.Churches@inghaminstitute.org.au", password="XXX-XXX-XXX-XXX")

So, if we run that, the result is...

 [1] "/plus/remote.php/webdav/"                                      
 [2] "/plus/remote.php/webdav/Documents/"                            
 [3] "/plus/remote.php/webdav/Photos/"                               
 [4] "/plus/remote.php/webdav/Screen Shot 2017-10-22 at 17.31.04.png"
 [5] "/plus/remote.php/webdav/Screen Shot 2017-10-22 at 17.32.58.png"
 [6] "/plus/remote.php/webdav/Screen Shot 2017-10-24 at 09.05.30.png"
 [7] "/plus/remote.php/webdav/Screen Shot 2017-10-24 at 09.06.35.png"
 [8] "/plus/remote.php/webdav/Screen Shot 2017-10-24 at 13.38.42.png"
 [9] "/plus/remote.php/webdav/Shared/"                               
[10] "/plus/remote.php/webdav/ingham-building.png"                   
[11] "/plus/remote.php/webdav/ownCloud_User_Manual.pdf"              
>

Bingo!

So it should be just a matter using other webDAV verbs together with the R curl library (as used in the example above) to send and fetch files. The CloudStor Sync client (for your operating system) can also be used to automatically fetch files in you prefer, but fetching them on-demand by name from within R is probably preferable.

Request: could someone else with access to CloudStor (if you have an Australian university username and password you should be able to just log in and get access, immediately) set up a folder in their CloudStor space, put some junk files in it, and share that folder with me (see my username in the script above), and let me know. Then it will be possible to confirm the the webDAV interface permits me to access your files in CloudStor, and vice-versa. I'd be surprised if it doesn't, but if it didn't, then that's a showstopper.

mdsumner commented 7 years ago

junk-ahoy, I shared some stuff - this is very interesting :) @timchurches

dfalster commented 7 years ago

And more from me

timchurches commented 7 years ago

OK, thanks to @mdsumner and @dfalster for setting up shared folders on CloudStor. Conveniently, they appear in the Shared/ folder in my CloudStor account (and v-v if I share one of my folders with others).

So, can we programatically fetch and store files in those shared folders? Yes we can! I just defined two more functions to handle GET and PUT, just wrappers around curl calls, and they just worked.

> library(curl)
> library(XML)

> listFiles <- function(username, password, relPath = "/", dav = "https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/") {
+   uri <- URLencode(pas .... [TRUNCATED] 

> getFile <- function(filename, outPath, username, password, relPath = "", dav = "https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/") {
+   uri  .... [TRUNCATED] 

> putFile <- function(filename, inPath, username, password, relPath = "", dav = "https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/") {
+   uri < .... [TRUNCATED] 

> pw <- "XXX-XXX-XXX-XXX"

> listFiles(relPath="Shared/testforunconf", username="Tim.Churches@inghaminstitute.org.au", password=pw)
[1] "01-australia.png" "horse.jpeg"       "horse2.jpeg"      "horse3.jpeg"      "horse4.jpeg"     
[6] "text1.md"         "text2.md"        

> getFile(filename="Rplot.png", outPath="~/Cloudstorr/downloads/", relPath="Shared/junk-ahoy/", username="Tim.Churches@inghaminstitute.org.au", passwo .... [TRUNCATED] 
[1] "https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/Shared/junk-ahoy/Rplot.png"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   242    0   242    0 100   242    0   242    0     0    382      0 --:--:-- --:--:-- --:--0 --:--:-- --:--:-- --:--:--   382
  0     0 100 56387  100 56387    0     0  13258      0  0:00:04  0:00:04 --:--:-- 21554

> getFile(filename="horse.jpeg", outPath="~/Cloudstorr/downloads/", relPath="Shared/testforunconf/", username="Tim.Churches@inghaminstitute.org.au", p .... [TRUNCATED] 
[1] "https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/Shared/testforunconf/horse.jpeg"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   242    0   242    0100   242    0   242    0     0    799      0 --:--:-- --:--:-- --:--:--   798
  0     0    0     0    0100  8036  100  8036    0     0   3228      0  0:00:02  0:00:02 --:--:--  6897

> putFile(filename="01-australia.png", inPath="~/Desktop/", relPath="Shared/testforunconf/", username="Tim.Churches@inghaminstitute.org.au", password= .... [TRUNCATED] 
[1] "https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/Shared/testforunconf/01-australia.png"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   242    0   242    0     0    463      0 --:--:-- --:--:-- ---:-- --:--:--   464
  0     0    0     0    0     0       0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0

> putFile(filename="01-australia.png", inPath="~/Desktop/", relPath="Shared/junk-ahoy/", username="Tim.Churches@inghaminstitute.org.au", password=pw)
[1] "https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/Shared/junk-ahoy/01-australia.png"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   242    0   242    0     0    564      0 --:--:-- --:--:-- --:--:--   565
  0     0    0     0    0     0      0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0
>

OMG ponies!

horse

So that all looks very promising. It should be simple to create wrapper functions for the other webDAV verbs, which allow you to copy or move files from one location to another on the remote storage (without downloading them), delete files etc.

That said, CloudStor also provides a nice Sync app (for Windows and MacOS, not sure about linux) which works just like DropBox: you point it at a local directory and everything in that directory is synced both to and from your CloudStor space, including the Shared/ folder. Thus, it would be easy for everyone in a research team to install the Sync app, and then have a shared data folder which just automatically syncs with one another. I suspect that the Sync app handles very large file uploads and downloads more robustly than cURL, by using re-startable chunked transfers, but that remains to be seen (or rather, to be empirically tested). Anyway, having explicit programmatic control over uploads and downloads seems somehow better.

OK, so all this is using shared folders in individual user CloudStor accounts. The beauty of that is that there is almost no overhead or delay in setting it up - all you and your collaborators need are your Australian university credentials and you are good to go, and you each get 100GB of storage - so a team of 10 will have 1TB at their disposal, albeit in ten chunks, free-of-charge and no forms to fill in at all.

However, AARnet also offer team spaces (group drives) on CloudStor, with 1TB of storage for free, more on request (and for a modest fee, I think) - see:

Group drives seem like the best option for ongoing projects. So let's request one and test it out. Could everyone who would like to participate send me their CloudStor email address at timothy.churches@unsw.edu.au by midday 25th Oct 2017, and I will send off a request for a group drive which hopefully will be furnished in time to test it during the unconference. The email address needs to be the one that CloudStor uses for your account, so log in to CloudStor first, using your normal university credentials (I think access for non-university personnel is also offered), and check your profile in teh CloudStor web interface to see what email address it is using (for some reason I have several university email addresses, all linked to one account, so I had to check which one it is using).

dfalster commented 7 years ago

Nice @timchurches

I can login online but am failing to connect via the desktop app.

Anyway, if you weren't aware, @karthik has been working on a package to interact with Dropbox, called rdrop. Some of that might be helpful here.

timchurches commented 7 years ago

@dfalster

Anyway, if you weren't aware, @karthik has been working on a package to interact with Dropbox, called rdrop. Some of that might be helpful here.

It looks like Dropbox uses a more modern (or post-modern, if you'll pardon the pun) API comprising HTTP POST requests with JSON arguments and JSON responses, with request authentication is via OAuth 2.0, rather than webDAV as used by CloudStor. But the overall design of the package might be useful.

@dfalster

I can login online but am failing to connect via the desktop app.

Did you set a specific Sync password on the My Account page in the web interface? It doesn't work with the credentials you use to authenticate to CloudStor via AAF.

raymondben commented 7 years ago

https://github.com/karthik/rdrop2 is the current dropbox package. But, yes, having similar syntax to that would be great, since then users could swap back-end storage from dropbox to cloudstor with minimal code changes. Or, if that's not convenient because of the differences between the dropbox and filesender APIs, maybe there is scope for a meta-package that provides a single set of upload/download/authentication/etc functions that work across a range of supported storage providers, essentially acting as an interface layer to the dropbox/cloudstor/otherprovider-specific packages.

timchurches commented 7 years ago

Huzzah! After fixing my incorrect putFile() function, I was able to programmatically send a 160MB binary data file (in feather format) to CloudStor from R, and then fetch it back again, from the same R script and lo, it survived the round-trip completely unscathed.

@raymondben, yes, the DropBox API seems to basically recapitulate webDAV, although DropBox API uses tokens whereas the CloudStor webDAV uses basic HTTP authentication (over https, so safe enough), so there would need to be a few differences. webDAV also offers file locking, which may be handy for file versioning if CloudStor supports locking. I don't think DropBox has that.

tarensanders commented 3 years ago

Hi folks,

Hope you don't mind me jumping into an old thread, but this is the first link that comes up if you Google "access cloudstor from R".

I just wanted to let anyone who comes across this thread know that we've wrapped @timchurches example functions for up/downloading data from Cloudstor into a simple package. It's available on CRAN and you can let us know any issues on the GitHub Repo.

ropensci / ozunconf17

Provide an R interface to AARNet CloudStor #39