ropensci / spocc

Species occurrence data toolkit for R
https://docs.ropensci.org/spocc
Other
115 stars 27 forks source link

inat occ query failing with limit > 3000 #215

Closed keatonwilson closed 5 years ago

keatonwilson commented 5 years ago

Hi there,

Interesting issue - code worked about a week ago, but now seems non-functional. Querying inat for a butterfly species with lots of records - gbif query works great, but inat query doesn't work when setting the limit greater than 3000. Console output below

Screen Shot 2019-04-09 at 11 09 20 AM
sckott commented 5 years ago

thanks for the report, please include the actual code next time, and not screenshots.

sckott commented 5 years ago

reinstall remotes::install_github("ropensci/spocc"), reload R session, and try again

keatonwilson commented 5 years ago

Thanks for the help - worked like a charm. Will include code next time. I knew it was the wrong choice as soon as I did it. ;)

sckott commented 5 years ago

glad it works

keatonwilson commented 5 years ago

This just cropped up for me again with a different species. The reinstall solution above is now not working. Reproducible example below.

#Reproducible Example of occ with iNat failing at high limits
#Keaton Wilson
#keatonwilson@me.com
#2019-05-21

#fresh install of spocc (as per last fix suggested on this thread)
remotes::install_github("ropensci/spocc", force = TRUE)

#Restart your R session here

#loading spocc
library(spocc)

#Successful query with small limits
monarch_500 = occ("Danaus plexippus", from = "inat", limit = 500)
monarch_500

#Can we pull the total number (53,066) - keep in mind, this takes a while.
monarch_full = occ("Danaus plexippus", from = "inat", limit = 53066)
monarch_full

#Nothing there - let's see if gbif works.
monarch_full_gbif = occ("Danaus plexippus", from = "gbif", limit = 50000)
monarch_full_gbif

#Quering GBIF seems to be functional - so it's an inat problem. 
sckott commented 5 years ago

thanks, will have a look - what does packageVersion("spocc") give you when you have spocc loaded?

keatonwilson commented 5 years ago

Thanks @sckott . It reads 0.9.0.9811.

sckott commented 5 years ago

i can't replicate your problem, but I only tried with up to 3200 records for inat. (tethered to phone now, will try with large limit later to see if that causes some kind of problem)

keatonwilson commented 5 years ago

Yeah, I just ran it successfully with pulling 3200 as well, so the problem must be pulling some number of records between 3200 and 53066 (or more). :)

keatonwilson commented 5 years ago

Additionally, just found a similar issue with querying gbif. I ran a search for Danaus plexippus for all gbif records (some where in 215k range). It ran overnight (over 12 hours) without finishing. Should I open a new issue for this?

sckott commented 5 years ago

having a look

sckott commented 5 years ago

I ran a search for Danaus plexippus for all gbif records (some where in 215k range)

for GBIF for that many records you're better off using the GBIF download API https://www.gbif.org/developer/occurrence#download available in rgbif with occ_download and related fxns - GBIF downloads isn't available through spocc as the interface is different from the normal GBIF search and GBIF downloads has a different user interaction where you submit a request then wait for it to be completed, so it wouldn't fit in with the other data sources

sckott commented 5 years ago

I'm still not getting no data proble on the Inat queries that you are getting. I do see with larger requests some warnings about combining data

x = occ("Danaus plexippus", from = "inat", limit = 18020)
#> There were 41 warnings (use warnings() to see them)
warnings()
#> Warning messages:
#> 1: In data.table::rbindlist(x, fill = TRUE, use.names = TRUE) :
#>   Column 2 ['tag_list'] of item 2 is length 0. This (and 0 others like it) has been filled with NA (NULL for list columns) to make each item uniform.

but the data is still returned in this case.

keatonwilson commented 5 years ago

Yeah, I get those warnings too, but when you query the occ object x, it shows 0 occurrences found and returned.

Screen Shot 2019-05-28 at 2 15 49 PM
keatonwilson commented 5 years ago

And thanks for the tip on GBIF - I'm trying to write a function that pulls and cleans all records from inat and gbif (a common workflow a number of projects we're working on), so it will be good to integrate the rgbif stuff for species with large numbers of occurrences.

sckott commented 5 years ago

all records meaning literally all data from GBIF and iNat?

keatonwilson commented 5 years ago

Sorry - no, nothing that crazy! All records for a particular species on both iNat and GBIF - I.e. can I get all records with lat/long for a particular species from both sources in a nice tidy data frame.

keatonwilson commented 5 years ago

Also, more strange behavior on inat query limits:

#Reproducible Example of occ with iNat failing at high limits
#Keaton Wilson
#keatonwilson@me.com
#2019-05-21

#fresh install of spocc (as per last fix suggested on this thread)
remotes::install_github("ropensci/spocc", force = TRUE)

#Restart your R session here

#loading spocc
library(spocc)

#Successful query with small limits
monarch_500 = occ("Danaus plexippus", from = "inat", limit = 3200)
monarch_500

#Can we pull the total number (53,066) - keep in mind, this takes a while.
monarch_bigger = occ("Danaus plexippus", from = "inat", limit = 18020)
monarch_bigger

#This is particularly strange, because it pulls less than the limit (limit = 18020, returned = 10041), but still works? What happens if we
#pull even more?
#
#
monarch_bigger_still = occ("Danaus plexippus", from = "inat", limit = 20000)
monarch_bigger_still

#And this is even weirder - now it pulls less than the total number, but slightly more than when the limit is set at 18020. 
sckott commented 5 years ago

okay, i finally did the limit = 53066 request and i do get the empty result - investigating

sckott commented 5 years ago

the root problem here is that inaturalist at some point changed to limit to 10,000 records maximum - so with pagination, which we do internally in spocc, you can only get for example 200 records starting at page 51, cause 51*200 = 10,200, which is more than 10,000

we need to error better so that user gets the message, so we'll do that, but not sure what the workaround is when more than 10K records needed

sckott commented 5 years ago

reinstall - i've made some changes. There isn't a fix for the issue of getting all the results though. But there are some alternatives. Staying within spocc, you can try getting inat data through gbif, e.g.:

iNaturalist limits: they allow at most 10,000; query through GBIF to get more than 10,000

The inat research grade dataset on GBIF https://www.gbif.org/dataset/50c9509d-22c7-4a22-a47d-8c48425ef4a7

x <- occ(query = 'Danaus plexippus', from = 'gbif', limit = 10100, 
   gbifopts = list(datasetKey = "50c9509d-22c7-4a22-a47d-8c48425ef4a7"))
x$gbif
sckott commented 5 years ago

ugh, lat/lon vars changed in the new API ...

keatonwilson commented 5 years ago

Nice. I'll re-install. I just finished a work-around that interacts with the inat api outside of spocc - it iterates through by year, which removes the page-limit issues. Happy to share code if you're at all interested.

A frustrating problem because I'm sure we're not the only group of folks interested in downloading all occurrence data from multiple sources. Thanks again for all of your hard work on this!

sckott commented 5 years ago

nice, that sounds good. by the way , the docs for the new inaturalist API we're using is here https://api.inaturalist.org/v1/docs/#!/Observations/get_observations

you can do date queries with it like:

x <- occ(query = 'Danaus plexippus', from = 'inat', limit = 10,
  inatopts = list(year = 2010))
x$inat$meta$found
#> [1] 193
x$inat$data$Danaus_plexippus$observed_on_details.year
#> [1] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010

y <- occ(query = 'Danaus plexippus', from = 'inat', limit = 10,
  inatopts = list(year = 2012))
y$inat$meta$found
#> [1] 478
y$inat$data$Danaus_plexippus$observed_on_details.year
#> [1] 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012

the output format for data from iNat has changed in the new API so the details of drilling down through data is a bit different i think

keatonwilson commented 5 years ago

If you're interested: code for the inat/gbif combination and cleaning/munging. Not the most elegant, but currently working (still figuring out some bugs on records with really high occurrence numbers).

https://github.com/keatonwilson/insect_migration/blob/master/scripts/get_clean_obs_function.R

sckott commented 5 years ago

nice. Are we all good on this? Anything else on this topic?

keatonwilson commented 5 years ago

All good - it seems like things are limited by the iNat API, so not much to do about it!