ropensci / pangaear

R client for the Pangaea database
https://docs.ropensci.org/pangaear
Other
21 stars 10 forks source link

Support passing bearer token for authorization (allows downloading protected datasets) and use HTTP content negotiation #79

Open uschindler opened 2 years ago

uschindler commented 2 years ago

Hi,

All HTTP requests sent to PANGAEA (important use HTTPS only!!!) should allow to user to send a Bearer token. This allows to download PANGAEA datasets which are protected. In pangaeapy this is already supported by passing an auth_token parameter to the constructor of the PANGAEA API. The bearer token is some temporary, opaque string the user can get from web page after logged in to PANGAEA. It is valid as long the user is logged in (timeout is 2 weeks). PANGAEA does not support to pass username/password because this would inheritly unsafe. People may suddenly post their R script on Github with their password included. If they post a script with a bearer token inside, it is enough to log out from PANGAEA to reject access and misuse of account.

In the documentation ask users to log in to PANGAEA, go to their profile page (https://www.pangaea.de/user/) and ask them to copy the token from there. Login MUST be done using a browser, all "automated tries" to login will be rejected by our servers. The token can be used as long as the user did not log out. It is recommended to check the box "keep logged in". Users can also login with ORCID at PANGAEA (that's another reason why user/pass does not work with APIs, many users do not even have a password ready).

But there is another change we would suggest to do: We ask you to send HTTP requests in a REST-like approach to PANGAEA dataset pages, because the current code uses some non-standardized format= parameters that might change soon. In addition, the HTTP client should also be prepared to follow redirects (coming soon). There is also a lot of if/then/else and it sometimes parses XML files to figure out if datasets are freely accessible, or if they are parents. This is done for the reason because it needs to guess datatype. It also looks like the code wants to not hammer PANGAEA with useless requests. But this is no problem at all. The response that a content type is not supported is cheap and the http status code comes fast. I'd do the data download like that:

This should always return the normal tab-separated-values format. No need to cross-check content-type in response or anything like that. The download code should only look at status code:

If you want to get the native PANGAEA metadata in panmd format, please DO NOT use oai-pmh. The native PANGAEA metadata can and should also be retrieved by content negotiation: Accept: application/vnd.pangaea.metadata+xml

And finally to get the citation as string use: Accept: text/x-bibliography.

See also those slides: https://docs.google.com/presentation/d/1mJEufjTK0O823Yc4zmsiNLua77_6p3UsBSXaVjx1A54/edit?usp=sharing

gavinsimpson commented 2 years ago

Thanks for this @uschindler

@naupaka and I have taken over maintaining {pangaear}, so I'm not even sure yet how everything you mention affects the package and what would be involved in implementing the behaviours you describe, but we'll take a look at figure out what needs to be done.