rsh249 / cRacle

Finally an R package dedicated to the CRACLE method for estimating climate from biological proxies.
MIT License
2 stars 1 forks source link

getextr: HTTP error 400 if genera are used #7

Open TheTams opened 2 years ago

TheTams commented 2 years ago

If genus names are used as input in getextr, there is an error in open.connection, e.g.

> #function to get GBIF data and extract climate or environmental data for each occurrence
> vec1_data <- getextr(c("Larix"), climrast, maxrec = 500000, schema = "flat", repo = c("gbif"), nmin = 15)
[1] "Larix"
GBIF Error in open.connection(con, "rb"): HTTP error 400.

[1] "There are    records in this query from  gbif  for  Larix"

If other genera are provided, these also fail.

If a species name is provided instead, no error is thrown.

#function to get GBIF data and extract climate or environmental data for each occurrence
> vec1_data <- getextr(c("Larix occidentalis"), climrast, maxrec = 500000, schema = "flat", repo = c("gbif"), nmin = 15)
[1] "Larix occidentalis"
New names:
* NA -> ...1
* NA -> ...2
* NA -> ...3
* NA -> ...4
New names:
* NA -> ...1
* NA -> ...2
* NA -> ...3
* NA -> ...4
New names:
* NA -> ...1
* NA -> ...2
* NA -> ...3
* NA -> ...4
New names:
* NA -> ...1
* NA -> ...2
* NA -> ...3
* NA -> ...4
New names:
* NA -> ...1
* NA -> ...2
* NA -> ...3
* NA -> ...4
[1] "There are  724  records in this query from  gbif  for  Larix occidentalis"
Aggregate Raster for flat or species filtering
Begin flat aggregate sampling... 1 
Taxon list of lenth: 1 
TheTams commented 2 years ago

The error appears to be occurring because the required max record is in excess of the limit in the GBIF API. This will also be true for single species like Larix decidua. Looking at how to use a downloaded file from the GBIF download service, to feed into cRacle.

rsh249 commented 2 years ago

I have a replacement function that uses the GBIF data downloads feature to get a zip file of records from the GBIF API. This will avoid these limits and is the recommended route for large downloads from GBIF.

This will hopefully be provided with some new guidance on downloads here soon.

rsh249 commented 2 years ago

@TheTams It would be better practice to use:

test <- gbif_dl( "Larix", gbif_user='yourusername', gbif_email="you@gmail.com", gbif_pw = "password")

This should generate a download request from GBIF and wait until that request is ready, download it, and then load a data frame compatible with other elements of the cRacle package.

TheTams commented 2 years ago

I was able to use this for my downloads, but then will be getting rid of any "BASIS_OF_RECORD" "LIVING_SPECIMEN". Defined as "public static final BasisOfRecord LIVING_SPECIMEN An occurrence record describing a living specimen, e.g. managed animals in a zoo or cultivated plants in a garden."

I don't think these should be included by default because there are many plants that can be grown in a botanic garden settings (or animals that can live in a zoo) in a region well outside their natural range due to human intervention, especially at critical life stages (seedlings etc).

rsh249 commented 2 years ago

Absolutely. I'll look at updating the downloads function to set these filters as well. This issue will be resolved if the new gbif_dl function behaves at least as well as the old API query version.

Will update here soon.

TheTams commented 2 years ago

I had some trouble with using gbif_dl. In the man pages it says you can enter taxa in the form genus species or genus, which I took to mean it would look up 'genus species' or by genus, but looking at the code it seems like it looks up the genus/genera, whether you enter 'genus species' or genera keys <- sapply(taxa, function(x) rgbif::name_backbone(name = x)$genusKey, USE.NAMES = F) .... rgbif::pred_in("taxonKey", unlist(keys)),

I found this unexpected. Should the documentation clarify this is just for genera and not species look ups? Or should the code be edited so "genus species" looks up the speciesKey and genus looks up genusKey? Perhaps there could be two functions gbif_dl_gen and gbif_dl_sp, one for each input type? A case example, I have a vector of species names that are Pinus section Hapoxylon. I do not want all the Pinus section dipoxylon. It still ends up a massive request.

Also interesting, I noticed if you look up "Salix" using this, it doesn't return any. Using "Salix L. does work. Salix is also a genera in animalia, so I assume that is the reason, and the same issue will apply to any other ambiguous genera. Should guidance be added that including the authority is the most robust? Or should the function limit the search to a chosen kingdom?

rsh249 commented 2 years ago

@TheTams I have made some updates in commit https://github.com/rsh249/cRacle/commit/42e71c737631ba23c75c8c800d92900d608fc6c0

The main fix being that I added some logic that doesn't use the genusKey anymore, but instead uses the taxon key for the best (hopefully only) name match. I think this works well for finding species when you give species names and genera when you put in just the genus. Can you try some use cases to see if you get what is expected?

The "LIVING SPECIMEN" Basis of record should no longer be included in the results returned from gbif_dl()

Also, I added an argument "kingdom" that can be set to clarify what taxon you are looking for in the case of colliding names like Salix. The default is kingdom = "all" which will ignore kingdom, but you can also set kingdom = "Plantae" to be more specific. Doing this fixes the Salix collision issue. Note that the "kingdom" argument applies to all taxa in your list so in the case of needing to search across multiple kingdoms you might have to set up more than one query.

TheTams commented 2 years ago

Hi @rsh249 Rob, I have now had a chance to re-run the same analysis that threw up all the issues, using gbif_dl. All of those changes (kingdom = Plantae, taxonKey, and no "living specimen") seem to be working out in my code. Have you added anything to the documentation of getextr() to note is should only be used for datasets below the GBIF API cutoff of 100 000 results?

rsh249 commented 2 years ago

I have not made that note yet, but I am thinking that it is going to be better to integrate gbif_dl() into getextr() and deprecate the old method for downloads. This would free all functions from the API restrictions and be more inline with what GBIF documentation requires.