Open uschindler opened 2 years ago
Thanks for this @uschindler
@naupaka and I have taken over maintaining {pangaear}, so I'm not even sure yet how everything you mention affects the package and what would be involved in implementing the behaviours you describe, but we'll take a look at figure out what needs to be done.
Hi,
All HTTP requests sent to PANGAEA (important use HTTPS only!!!) should allow to user to send a Bearer token. This allows to download PANGAEA datasets which are protected. In pangaeapy this is already supported by passing an
auth_token
parameter to the constructor of the PANGAEA API. The bearer token is some temporary, opaque string the user can get from web page after logged in to PANGAEA. It is valid as long the user is logged in (timeout is 2 weeks). PANGAEA does not support to pass username/password because this would inheritly unsafe. People may suddenly post their R script on Github with their password included. If they post a script with a bearer token inside, it is enough to log out from PANGAEA to reject access and misuse of account.In the documentation ask users to log in to PANGAEA, go to their profile page (https://www.pangaea.de/user/) and ask them to copy the token from there. Login MUST be done using a browser, all "automated tries" to login will be rejected by our servers. The token can be used as long as the user did not log out. It is recommended to check the box "keep logged in". Users can also login with ORCID at PANGAEA (that's another reason why user/pass does not work with APIs, many users do not even have a password ready).
But there is another change we would suggest to do: We ask you to send HTTP requests in a REST-like approach to PANGAEA dataset pages, because the current code uses some non-standardized
format=
parameters that might change soon. In addition, the HTTP client should also be prepared to follow redirects (coming soon). There is also a lot of if/then/else and it sometimes parses XML files to figure out if datasets are freely accessible, or if they are parents. This is done for the reason because it needs to guess datatype. It also looks like the code wants to not hammer PANGAEA with useless requests. But this is no problem at all. The response that a content type is not supported is cheap and the http status code comes fast. I'd do the data download like that:Authentication: Bearer token
if available (see above). No need to check if it is login protected before. Just send always if available. If the token is invalid it is ignored.Accept: text/tab-separated-values
as header. This enables content negotiation. As this header does NOT look like a plain stupid browser, the PANGAEA code will switch to real "REST mode" and for example respond with correct headers instead of redirects to login page if the dataset is password protected and the credentials do not match. So you don't need to do best guesses when you were redirected and you get the HTML login page. A real REST client will get correct status code to know: "unauthorized".This should always return the normal tab-separated-values format. No need to cross-check content-type in response or anything like that. The download code should only look at status code:
If you want to get the native PANGAEA metadata in panmd format, please DO NOT use oai-pmh. The native PANGAEA metadata can and should also be retrieved by content negotiation:
Accept: application/vnd.pangaea.metadata+xml
And finally to get the citation as string use:
Accept: text/x-bibliography
.See also those slides: https://docs.google.com/presentation/d/1mJEufjTK0O823Yc4zmsiNLua77_6p3UsBSXaVjx1A54/edit?usp=sharing