ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

Change `getURL()` usage? #17

Closed sckott closed 9 years ago

sckott commented 10 years ago

Hey @dwinter Curious if you would consider changing use of RCurl::getURL() that I think are used to do GET requests across your pkg to something else.

If you use RCurl::getForm() or httr::GET() we can allow users to pass in curl debugging options, print verbose curl output, set a timeout, pass encoding options, ec.

I vote for httr, but either would do.

One benefit of this is in fulltext so that users can pass curl options across all sources

thoughts?

dwinter commented 10 years ago

Hi @sckott, I was looking into this the other day, and would happily do it.

Can you point me to some code using httr::GET() to have an idea of how a user might take advantage of it?

sckott commented 10 years ago

Here's a GET wrapper i have in my analogsea pkg: https://github.com/sckott/analogsea/blob/master/R/zzz.R#L12-L27

In the fxn definition, we use the ellipsis ... to allow users to pass on named arguments, so they could do do_GET( param1, parm2, config=verbose() ), where the config=verbose() is the part that's passed on to httr::GET. But it looks like you are already using ... in your functions. In that case you can just have a named parameter in the fxn defintion, e.g,:

library('httr')

foo <- function(taxonkey, limit=5, config=list()){
  args <- list(taxonKey=taxonkey, limit=limit)
  GET("http://api.gbif.org/v1/occurrence/search", config, query=args)
}

foo(taxonkey = 2, limit = 3, config = verbose()) # print verbose curl output
foo(taxonkey = 2, limit = 3, config = timeout(0.1)) # set a timeout so forces stop after certain time
foo(taxonkey = 2, limit = 3, config = c(verbose(), timeout(0.1))) # set verbose and timeout together
foo(taxonkey = 2, limit = 3, config = progress()) # progress bar

See httr_options() for options.

dwinter commented 10 years ago

Cool. Added to my todo list: :memo:

Will try and ake some time for it this week, as the open tree hackathon is the one after.

sckott commented 10 years ago

Cool, let me know if you have any questions.

dwinter commented 9 years ago

Hi @sckott , have started work on a branch that uses httr to pull stuff down.

At the moment, it uses httr in the background and passes on config options to GET. The way I have rentrez set up now the "top level" functions of all return parsed objects (mainly lists), and not the complete response. Does this work with what you had planned for fulltext? Or would it be better to have the list returned by these functions include an httr response object? (maybe optionally?)

sckott commented 9 years ago

@dwinter I think in an ideal world, across all our pkgs we would optionally allow the httr response object for users to be able to debug calls. However, I think almost all of our pkgs don't allow that (could be changed of course). I guess we've taken the approach that most of our users are not advanced programmers so may not know how to deal with stuff in the httr response object.

having said all that, it's your pkg, so you can optionally return the httr response object, or along with the data itself. Most pkgs feeding into fulltext from our suite return lists/data.frame's

What I do want to be able to do is allow users to pass in curl debugging options via the config parameter e.g.,

foo <- function(bar, ...){
  GET("http://google.com", query=list(bar=bar), ...)
}
library(httr)
foo(bar="hello world", config=verbose()) # named with config
dwinter commented 9 years ago

Cool, when I was playing around with this I'd thought you needed the response object to make use of the options, apparently not:

entrez_search(db="pubmed", term="10.1038/nature08789[doi]", httr_options=verbose() )

* Found bundle for host eutils.ncbi.nlm.nih.gov: 0x34895c0
* Connection 38 seems to be dead!
* Closing connection 38
* Hostname was NOT found in DNS cache
*   Trying 130.14.29.110...
* Connected to eutils.ncbi.nlm.nih.gov (130.14.29.110) port 80 (#39)
> GET /entrez/eutils/esearch.fcgi?=&db=pubmed&term=10.1038%2Fnature08789%5Bdoi%5D&emails=david.winter%40gmail.com&tool=rentrez HTTP/1.1
Host: eutils.ncbi.nlm.nih.gov
Accept: */*
Accept-Encoding: gzip
Cookie: WebEnv=1RcAPP%409DFC5FC84205B9D1_0355SID; ncbi_sid=9DFC5FC84205B9D1_0355SID
user-agent: curl/7.35.0 Rcurl/1.95.4.1 httr/0.3

< HTTP/1.1 200 OK
< Date: Mon, 22 Sep 2014 18:23:28 GMT
* Server Apache is not blacklisted
< Server: Apache
< Access-Control-Allow-Origin: *
< Cache-Control: private
< NCBI-SID: 9DFC5FC84205B9D1_0355SID
< Content-Type: text/xml; charset=UTF-8
* Replaced cookie ncbi_sid="9DFC5FC84205B9D1_0355SID" for domain nih.gov, path /, expire 1442946208
< Set-Cookie: ncbi_sid=9DFC5FC84205B9D1_0355SID; domain=.nih.gov; path=/; expires=Tue, 22 Sep 2015 18:23:28 GMT
< X-UA-Compatible: IE=Edge
< Transfer-Encoding: chunked
< 
* Connection #39 to host eutils.ncbi.nlm.nih.gov left intact
Entrez search result with 1 IDs (max = 1 )

Does it make sense to have config as default argument in packages feeding in to fulltext?

sckott commented 9 years ago

There are other params you could pass in to GET, POST, etc. , see the param description

so a ... would be best, but if you already are using ... for other things in your functions, then config is good