inbo / etn

R package to access data from the European Tracking Network
https://inbo.github.io/etn/
MIT License
6 stars 4 forks source link

OpenCPU approach #242

Open peterdesmet opened 2 years ago

peterdesmet commented 2 years ago

Summary of March 8 meeting:

Create two "flavours" of all functions in the etn package: local and remote. Function names remain the same, the difference is made in the con variable. For local con is a database connection, for remote con are credentials to be passed via OpenCPU. The intent is to keep a single R package.

For remote access, bandwidth and file size might become an issue. Potential solutions:

Next steps:

salvafern commented 2 years ago

I changed slightly one function to make a connection to the etn directly by providing your username and password in https://github.com/inbo/etn/commit/34254e2db2c8c048094e1566d6f4a41582f1cbe3.

Now it works fine with opencpu! so it seems that we have to go for this solution instead of using the connection object. I hope this is ok for you @peterdesmet?

peterdesmet commented 2 years ago

Great! For backwards compatibility, I suggest to keep using the con variable as a single parameter, e.g. as a list with:

con <- list(
  username = "x",
  password = "y"
)

That avoids parameters to be shifted in functions:

get_animals(my_con, 305)
# still calls:
get_animals(con = my_con, animal_id = 305)
# rather than:
get_animals(username = my_con, password = 305)
salvafern commented 2 years ago

I gave it a try in 8f93f5a. It works but you will have to be careful with the encoding of the = equal symbol(See https://github.com/opencpu/opencpu/issues/110).

This doesn't work

curl -L http://localhost:8004/ocpu/library/etn/R/list_animal_ids/ -X POST -d "con=list(username='salvador.fernandez@vliz.be', password='mypassword')"
# Unparsable argument: list(username

This does:

curl -L http://localhost:8004/ocpu/library/etn/R/list_animal_ids/ -X POST -d "con=list(username%3D'salvador.fernandez@vliz.be', password%3D'mypassword')"
# /ocpu/tmp/x0829c6393d9cea/R/.val
# /ocpu/tmp/x0829c6393d9cea/R/list_animal_ids
# /ocpu/tmp/x0829c6393d9cea/stdout
# /ocpu/tmp/x0829c6393d9cea/source
# /ocpu/tmp/x0829c6393d9cea/console
# /ocpu/tmp/x0829c6393d9cea/info
# /ocpu/tmp/x0829c6393d9cea/files/DESCRIPTION

See also the opencpu documentation about passing arguments: https://www.opencpu.org/api.html#api-arguments

I haven't tested in R but I think it will be fine with passing the arguments through utils::URLencode()

peterdesmet commented 2 years ago

Ok great! Within the function(s) we can URLencode all parameters before calling the OpenCPU endpoint.

We can also extend con now to contain a remote property:

con = list(
  user = "x",
  password = "y",
  remote = TRUE
)

if con$remote {
  # use openCPU (with url encoded parameters)
} else {
  # use local DB connection
}
peterdesmet commented 2 years ago

@salvafern I would like to implement the OpenCPU functionality over the summer. Are all the ETN package endpoints available in OpenCPU now?

salvafern commented 2 years ago

Hi @peterdesmet we are working on it and they will be ready as soon as possible. I will let you know.

salvafern commented 1 year ago

The etn package is available at: https://opencpu.lifewatch.be/

peterdesmet commented 1 year ago

I'm getting a 403 error for https://opencpu.lifewatch.be/

salvafern commented 1 year ago

The access is forbidden for internet browsers. Try with curl or from R.

damianooldoni commented 1 year ago

Thanks @salvafern. Indeed, connection can be established via R (package curl):

curl::curl(url = "https://opencpu.lifewatch.be/")
A connection with                                           
description "https://opencpu.lifewatch.be/"
class       "curl"                         
mode        "r"                            
text        "text"                         
opened      "closed"                       
can read    "yes"                          
can write   "no" 
PietrH commented 1 year ago

I gave it a try in 8f93f5a. It works but you will have to be careful with the encoding of the = equal symbol(See opencpu/opencpu#110).

This doesn't work

curl -L http://localhost:8004/ocpu/library/etn/R/list_animal_ids/ -X POST -d "con=list(username='salvador.fernandez@vliz.be', password='mypassword')"
# Unparsable argument: list(username

This does:

curl -L http://localhost:8004/ocpu/library/etn/R/list_animal_ids/ -X POST -d "con=list(username%3D'salvador.fernandez@vliz.be', password%3D'mypassword')"
# /ocpu/tmp/x0829c6393d9cea/R/.val
# /ocpu/tmp/x0829c6393d9cea/R/list_animal_ids
# /ocpu/tmp/x0829c6393d9cea/stdout
# /ocpu/tmp/x0829c6393d9cea/source
# /ocpu/tmp/x0829c6393d9cea/console
# /ocpu/tmp/x0829c6393d9cea/info
# /ocpu/tmp/x0829c6393d9cea/files/DESCRIPTION

See also the opencpu documentation about passing arguments: https://www.opencpu.org/api.html#api-arguments

I haven't tested in R but I think it will be fine with passing the arguments through utils::URLencode()

Does this method expose the credentials to anyone on the network? Or are they already encrypted somehow this way?

bart-v commented 1 year ago

Since opencpu.lifewatch.be is HTTPS by default the credentials are secure

PietrH commented 1 year ago

Excellent,

is the /ocpu/tmp exposed? I'm getting a 403 on both https://opencpu.lifewatch.be/tmp and https://opencpu.lifewatch.be/ocpu/tmp paths. I can POST function calls just fine, but not retrieve the results.

The same code works on https://cloud.opencpu.org/ocpu so it might be something in the server setup? Or I might just have the address slightly wrong too. I'm trying to get to https://opencpu.lifewatch.be/tmp/x0715fee402d82f/stdout

bart-v commented 1 year ago

No /ocpu/tmp is not exposed. From the tests by @salvafern this seemed not needed Why has this changed?

PietrH commented 1 year ago

As I understand in 4.3 in the manual, a user posts the function call with arguments, the response includes a tmp path where the user again gets the response objects. You can also request the function output immateriality as a json object in the call using the /json flag.

That second option is less attractive to me as some functions return rather large tabular outputs where I'd like a bit more control in the format that they are retrieved, probably rda using gzip compression to reduce server io.

Maybe I'm missing something?

bart-v commented 1 year ago

Yes, we have been using /json all the time. Can you please start with that?

PietrH commented 1 year ago

I've adapted list_animal_ids() to list_animal_ids_api(), seems to work to me: a61a4bb6de1bacd297b6255dd20254885db02f05

Next I'll adapt a more complicated function to work by directly providing username and password as arguments, I was thinking about get_acoustic_detections() so we can test retrieving tabular data via the API.

PietrH commented 1 year ago

After further testing I'd like to argue in favor of exposing /ocpu/tmp:

PietrH commented 1 year ago

Any opinions @bart-v @salvafern ?

bart-v commented 1 year ago

There is obvious some security issues involved, i.e. people could just "steal" the output of a more privileged user by guessing the session-id. It seems this is handled by a cleaning cron job. https://github.com/opencpu/opencpu/issues/194

Can you confirm that the output is written in a random, temporary folder i.e. /ocpu/tmp/<random>/, and not immediately in /ocpu/tmp

PietrH commented 1 year ago

I can confirm every function call should create a new dir under /ocpu/tmp, for example /ocpu/tmp/x04e894ea2366bd , I've created a gist running on google colab to demonstrate:

https://gist.github.com/PietrH/14cdb3cb581a3b835221d8b641e74b51

This demo makes use of the opencpu test api (calling rnorm).

We could sanitize these paths on a steady interval. I also believe brute forcing the keys would be quite the challenge since you'd need to try a lot of keys with no guarantee on the type of result even if you manage to find a path that's in use, this risk is further mitigated with protections that might already be in place to protect from denial of service attacks.

bart-v commented 1 year ago

OK paths like /tmp/x04e894ea2366bd/ are now exposed

PietrH commented 1 year ago

I'm still getting a 403 on

https://opencpu.lifewatch.be/ocpu/tmp/x010f9753592ec8/R/.val/print

Are subdirectories also exposed? Is there a mistake in my domain?

bart-v commented 1 year ago

The base path is without "ocpu" So https://opencpu.lifewatch.be/tmp (...)

I thought we only downloaded files and not special paths like .val, etc...

PietrH commented 1 year ago

My apologies for the confusion, after a POST request the client sends a GET request to one of the paths provided in the POST response body. The most common case will be /tmp/{key}/R/.val with then the requested datatype as a suffix, rds in our case. It's my understanding we'll also be able to use this workflow to get other formats such as is needed for write_dwc() and download_acoustic_dataset()

For example you'd GET https://opencpu.lifewatch.be/ocpu/tmp/x010f9753592ec8/R/.val/rds to get a rds stream (compressed) or GET https://opencpu.lifewatch.be/ocpu/tmp/x010f9753592ec8/R/.val/csv or https://opencpu.lifewatch.be/ocpu/tmp/x010f9753592ec8/R/.val/feather (this key might have been voided by the time you read this).


I'm following the opencpu manual, section 4.3: https://opencpu.github.io/server-manual/opencpu-server.pdf

Performing a HTTP POST on a function results in a function call where the HTTP request arguments are mapped to the function call. In OpenCPU, a successful POST requests usually returns a HTTP 201 status, and the response body contains the locations of the output data

The output can then be retrieved using HTTP GET. When calling an R function, the output object is always called .val. However, calling scripts might result in other R objects.

bart-v commented 1 year ago

OK, https://opencpu.lifewatch.be/tmp/x010f9753592ec8/R/.val/rds works now Remember to drop the /ocpu/

PietrH commented 1 year ago

It's working now! Thanks for all the help. I'll keep you updated with my progress.