iDigBio / ridigbio

ridigbio -- an R interface to iDigBio's API (see http://www.idigbio.org/)
http://idigbio.github.io/ridigbio/
Other
16 stars 10 forks source link

Add support for download API end point #23

Closed mjcollin closed 4 years ago

tphilippi commented 4 years ago

This would be huge for me. My use case is requesting all records for a set of species in a National Park unit (using a bounding box). ridigbio::idig_search_records() works great for summarizing the results, but I really need a downloaded DwC-A file and DOI for data provenance, especially for anything having to do with sensitive species.

When I have a list of species, it is difficult to paste them into the website to get a DwC-A with just those species. I'm trying to build this myself from the download API documentation, but I'm an old dog and POST is something of a new trick for me. I think I should be able to use some combination of the json version of my rq from the search API and the base url of the download api, but I'm not getting it, perhaps because the POST request isn't documented yet.

roncanepa commented 4 years ago

Regarding your specific case, it could also be the URL encoding that's causing your rq not to work with a GET request. If you'd like to share your query either here or as a followup to our email exchange, we can maybe help you adjust it to see if we can get it to work.

wilsotc commented 4 years ago

I have been working on our API usage documentation and this has been requested a number of times. I think it would be useful to have a working script to download a DwC-A file from iDigBio given a search API query.

I've been doing these examples using Python 3 but I'm not sure that is useful for everyone. Ideally I might try one example in multiple popular languages. Which language(s) do you use? Also what OS do you use Windows, OSX, iOS, Android, or Linux?

roncanepa commented 4 years ago

There are two mostly separate issues, really. 1) is the improvement of the generic download API documentation which @wilsotc has been working on.

2) is in the context of the ridigbio package itself and its ability to interact with the download API. Besides technical work, there are a number of complicated UX problems that would have to be handled in order for this to work, and given that the ridigbio package is meant to make things easier for people, I'm not inclined to try and add in download api functionality at this time.

tphilippi commented 4 years ago

@roncanepa I understand what you say about ridigbio not hitting the download API. Folks who actually need it should be able to write their own call with a bit more documentation, and once I get something working I will write and post a couple page document with examples to help others..

@wilsotc a simplified version of my request is:

rq <- list(sceintificname = c("aureolaria virginica", "celastrus scandens",
                              "celastrus scandens", "spiranthes magnicamporum"),
            geopoint = list(type = "geo_bounding_box",
                            top_left = list(lat = 41.44732], 
                                            lon = -81.68442,
                            bottom_right = list(lat = 41.06291, 
                                                lon = -81.45506)       
            )

that request works fine with ridgbio and the search API:

iDig_hits <- try(idig_search_records(rq, fields = fieldlist2))

I tried just tweaking the POST in ridigbio by changing the baseURL, then tried several variants using GET.

json <- jsonlite::toJSON(rq, auto_unbox = TRUE)

tried with and without adding one more level of nesting

rqx <- list(rq = rq)
json <- jsonlite::toJSON(rqx, auto_unbox = TRUE)

I tried with and without curl::curl_escape(json2), and as separate url & path vs pasted together url:

   iDigURL <- "https://api.idigbio.org/v2/download/?rq="
   bigURL <- paste0(iDigURL, curl::curl_escape(json2), "&email=tephilippi@gmail.com")

  req <- httr::GET(iDigURL, query=json, httr::accept_json(), 
                      httr::content_type_json())
  req <- httr::GET(curl::curl_escape(bigURL), httr::accept_json(), 
                      httr::content_type_json())

So I'm clearly floundering without a bit more documentation or an example.

I was able to past my list of scientific names and the bounding box coordinates into the iDigBio Portal, and get the DwC-A I needed for data provenance for this park's need. But with the new funding for park maintenance & restoration projects, I'm likely to start doing this hundreds of times a year, so I'd like to eliminate that copy & paste step..

roncanepa commented 4 years ago

@tphilippi , thank you for providing your example code.

Before we dig in to get this to work via R, I'd like to ask: what is your ultimate goal regarding trying to pull in the download API via R? As I mentioned in my email reply (please let me know if you didn't receive it!), the "link" that you get from an iDigBio download request is not a permanent, stable reference for citation, provenance or reproducibility purposes. For that, you would then need to upload the resulting file to a research data repository.

In addition, the iDigBio download process is not synchronous, meaning that you can't have R or other code "wait" for the reply in realtime. Depending on the number of records in your search result, the process could take minutes, hours, or even a day or more. You would need to write some extra R code to handle this case and essentially "poll" a specific URL on a schedule in order to check the status of your download process. You would then actually request the downloaded file once the status URL indicates that it is ready.

I mention all of that because depending on what you have in mind, it might be easier to get the download file outside of R (for instance, via command line requests with CURL) and then pull the data into R once you have it.

It may be worthwhile doing this as a sort of general documentation purpose, but I'd also like to help you move forward with your analysis.

roncanepa commented 4 years ago

@tphilippi There may also be issues with accessing our download api via POST. I'm looking more into this now.

If you'd like to do this using R and a GET request, here are some things that will help.

You can use a script similar to this (you were very close; the encoding and other things get very tricky):

library(jsonlite)
library(httr)

rq <- list(scientificname = c("shortia"),
           geopoint = list(type = "geo_bounding_box",
                           top_left = list(lat = 41.44732), 
                           lon = -81.68442,
                           bottom_right = list(lat = 41.06291, 
                                               lon = -81.45506)       
           )
)

my_query <- jsonlite::toJSON(rq)

iDigURL <- "https://api.idigbio.org/v2/download/"

full_url = paste0(ionDigURL, "?rq=", my_query) 

req = GET(full_url)

I tried to stick close to your original source to help make it easier to plug things back into your own script. A few notes about this:

You can take a look at the result of the request (via req) and you'll see some things you'll need to take note of:

str(content(req))
List of 7
 $ complete   : logi FALSE
 $ created    : chr "2020-11-04T21:00:23.640045+00:00"
 $ expires    : chr "2020-12-04T21:00:23.643345+00:00"
 $ hash       : chr "4cddb15328a298309dd682a3684ad38c592292c6"
 $ query      :List of 7
  ..$ core_source       : chr "indexterms"
  ..$ core_type         : chr "records"
  ..$ form              : chr "dwca-csv"
  ..$ mediarecord_fields: NULL
  ..$ mq                : NULL
  ..$ record_fields     : NULL
  ..$ rq                :List of 2
  .. ..$ geopoint      :List of 4
  .. .. ..$ bottom_right:List of 2
  .. .. .. ..$ lat:List of 1
  .. .. .. .. ..$ : num 41.1
  .. .. .. ..$ lon:List of 1
  .. .. .. .. ..$ : num -81.5
  .. .. ..$ lon         :List of 1
  .. .. .. ..$ : num -81.7
  .. .. ..$ top_left    :List of 1
  .. .. .. ..$ lat:List of 1
  .. .. .. .. ..$ : num 41.4
  .. .. ..$ type        :List of 1
  .. .. .. ..$ : chr "geo_bounding_box"
  .. ..$ scientificname:List of 2
  .. .. ..$ : chr "shortia"
  .. .. ..$ : chr "scandens"
 $ status_url : chr "https://api.idigbio.org/v2/download/a0d37b75-f2f3-4e9f-9177-23a24937ce33"
 $ task_status: chr "PENDING"

task_status is the first one. As I mentioned earlier, this is an asynchronous process, so depending on the size fo your search results, this may take some time.

status_url: this is what you'll want to load every so often to check on task_status.

when you load the status url, you will eventually see something like this once it's finished building your download:

{
  "complete": true, 
  "created": "2020-11-04T20:50:55.495767+00:00", 
  "download_url": "http://s.idigbio.org/idigbio-downloads/00fc27e1-059e-425f-af97-f0c630a6fb71.zip", 
  "expires": "2020-12-04T20:50:55.678456+00:00", 
  "hash": "617075861d998d2d9c5f3209d96126835f914689", 
  "query": {
    "core_source": "indexterms", 
    "core_type": "records", 
    "form": "dwca-csv", 
    "mediarecord_fields": null, 
    "mq": null, 
    "record_fields": null, 
    "rq": {
      "geopoint": {
        "bottom_right": {
          "lat": [
            41.0629
          ], 
          "lon": [
            -81.4551
          ]
        }, 
        "lon": [
          -81.6844
        ], 
        "top_left": {
          "lat": [
            41.4473
          ]
        }, 
        "type": [
          "geo_bounding_box"
        ]
      }, 
      "scientificname": [
        "shortia"
      ]
    }
  }, 
  "status_url": "https://api.idigbio.org/v2/download/00fc27e1-059e-425f-af97-f0c630a6fb71", 
  "task_status": "SUCCESS"
}

where the important thing is download_url to get your results file.

roncanepa commented 4 years ago
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] httr_1.4.1     jsonlite_1.6.1

loaded via a namespace (and not attached):
[1] compiler_3.4.4 R6_2.4.0       tools_3.4.4    curl_4.2 
tphilippi commented 4 years ago

1: I am fine not having a DOI. It would have been gravy.

2: I would like a Darwin Core Archive, even though iDigBio DwC-A files don't include a metadata.eml or metadata.xml metadata file. I can retain that DwC-A file as the origin for my data provenance; and it includes the exact query used.

3: I can get what I need out of the DwC-A occurrence.csv file with much less effort than from the results of ridigbio::idig_search_records(),. My problem with idig_search_records() is that with fields = "all" or an explicit set of fields specified, institutionid and institutionname are all NA. The information I need is stashed in attributes, which I can parse out, then match the attribute uuid with the occurrence record's recordset. I'll stick a reproducible example in my next comment (along with sessionInfo()).

My use-case is that someone will be contracted to conduct a rare plant survey for a National Park. I want to provide them with all known occurrence records of species in their target list. By next year we will be doing over 100 such queries per year, which is why I'm trying to script as much as I can. My workflow is to take the vector of taaxonomic names the park provides, expand it to all synonyms (accepted/not accepted, valid/invalid), then expand that vector to all sub-taxa down to subspecies and variety, then query gbif and iDigBio. I am not guarantying that all returned records are taxa of interest: given enough information they can drop records with overlapping names sometime/somewhere but different concepts. I want to have reproducibility and data provenance, property cite and credit iDigBio, but include which museum or herbarium has each specimen.

tphilippi commented 4 years ago

My rq above was a hasty paste & edit on the fly: I apologize.. The coordinates for the bounding boxes are all from an sf object, and the query actually works. scientificname is correct for the idig_records_search(); thanks for the pointer that it is scientificName for their download API.
Here's an example of the institutionname being all NA, and my workaround to bring it back from the attribute.

library(jsonlite)
library(httr)
library(ridigbio)

rq <- list(scientificname = c("dichanthelium latifolium",
                              "glyceria canadensis",
                              "solidago ulmifolia"),
           geopoint = list(type = "geo_bounding_box",
                           top_left = list(lat = 41.44732, 
                                           lon = -81.68442),
                           bottom_right = list(lat = 41.06291, 
                                               lon = -81.45506)       
                           )
)
iDig_hits <- try(idig_search_records(rq, fields = "all"))
table(iDig_hits$institutioncode, useNA = "always") # 2 letters, not adequate for what I need
table(iDig_hits$institutionname, useNA = "always") # all NA

# repopulate those values
tmp <- attr(iDig_hits, "attribution")
fn <- function(x) {
          xx <- data.frame(name = x$name,
                           emllink = x$emllink,
                           uuid = x$uuid)
          return(xx)
}

tmp2 <- lapply(tmp, fn)
tmp3 <- do.call("rbind", tmp2)

iDig_hits$institutionname <- tmp3$name[match(iDig_hits$recordset, tmp3$uuid)]
table(iDig_hits$institutionname, useNA = "always")
sessionInfo()

and my sessionInfo() I'm stuck with win10 on my work machine. I get the same result with R 3.6.3 and ridigbio_0.3.5 httr_1.4.1 jsonlite_1.6.1 on win10

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ridigbio_0.3.5 httr_1.4.2     jsonlite_1.7.1

loaded via a namespace (and not attached):
[1] compiler_4.0.3 plyr_1.8.6     R6_2.4.1       curl_4.3       Rcpp_1.0.5    
[6] fortunes_1.5-5
tphilippi commented 4 years ago

While I'm confident your example for hitting the download API works for real operating systems and well-configured firewalls, in my government computer with win10 and some hinky ssl certificates I get:

> req = GET(full_url)
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Failure when receiving data from the peer

However, I get the full results simply by replacing the GET() call with url():

req <- jsonlite::fromJSON(url(full_url)) 

So thank you very much for your help! That example is all I need, and circling back, I don't know that an explicit idig_download() function is needed in the package. I will try to write up a short how-to document and put it somewhere that folks can find it. But if @wilsotc c is writing a document, even better. Please include this example.

roncanepa commented 4 years ago

@tphilippi Glad to hear that you got what you needed!

Regarding documentation from this ticket, a few things can/will happen from here:

  1. I'll take my minimal example and add it as a use case in the API WG repo: https://github.com/biodiversity-specimen-data/specimen-data-use-case
  2. We'll make sure that this example also appears in the idigbio download api wiki page as @wilsotc mentioned
  3. I'll include your note about having to use url() in some cases

PS: It occurs to me that you might be interested in an API user group that we have. It's fairly new and has a focus on R but any languages are welcome. We hold twice-monthly open office hours if you'd ever like to drop by to listen in, ask questions, or discuss what you're working on. Details on that are here: https://www.idigbio.org/wiki/index.php/IDigBio_Working_Groups#API_User_Group_.28R-based.29

In addition, if you get to the point where you'd like to share your solution, please let me know, and I'll ensure that it gets linked in the "use cases" repo for the working group. A few of us submit code examples directly to the repo but we also want to link to external solutions that other people have worked on and it sounds like yours would be a great example for someone else.

roncanepa commented 4 years ago
  1. is in the context of the ridigbio package itself and its ability to interact with the download API. Besides technical work, there are a number of complicated UX problems that would have to be handled in order for this to work, and given that the ridigbio package is meant to make things easier for people, I'm not inclined to try and add in download api functionality at this time.

Given my comment above, I'm going to mark this as a wontfix. Once we have a few bits of documentation in place, I'll come back to edit the issue description to include a summary and some links.