idig_search_records returns lexical error when retrieving large number of data using max_items or offset argument

hmorzaria commented 8 years ago

When trying to download all records from a country, in this case Mexico, specifying as max_items the number of total records

rec_count <- try(idig_count_records(rq=list(country=eachcountry, geopoint=list(type="exists")), fields=c("scientificname", "geopoint")))

rec_count 1306923

Downloading one less that number of records

df1 <- try(idig_search_records(rq=list(country=eachcountry, geopoint=list(type="exists")), fields=c("scientificname", "geopoint"),max_items = 1306922))

Error in idig_search(rq = rq, fields = fields, max_items = max_items, : Search would return more than 1306922 results. See max_items argument.

downloading the max number of records

df1 <- try(idig_search_records(rq=list(country=eachcountry, geopoint=list(type="exists")), fields=c("scientificname", "geopoint"),max_items = 1306923))

Error : lexical error: invalid char in json text. [query_phase_execution_exception (right here) ------^

Using offset and limit, will not allow offset higher than 50000

df1 <- try(idig_search_records(rq=list(country=eachcountry, geopoint=list(type="exists")), fields=c("scientificname", "geopoint"),limit = 5000, offset = 45000))

head(df1)

scientificname geopoint.lon geopoint.lat 1 echinocereus maritimus -115.7390 30.04060 2 atlapetes pileatus pileatus -101.6837 19.51897

df1 <- try(idig_search_records(rq=list(country=eachcountry, geopoint=list(type="exists")), fields=c("scientificname", "geopoint"),limit = 5000, offset = 50000))

Error : lexical error: invalid char in json text. [query_phase_execution_exception (right here) ------^

mjcollin commented 8 years ago

This is a consequence of changes to the Elastic Search API (where the data is stored) that our API is layered on top of that the ridigbio package uses.

The issue that is open on our API is here:

https://github.com/iDigBio/idigbio-search-api/issues/18

The R package throwing an error due to parsing JSON is not great behavior but that comes from the error message being returned not being in JSON; I'd need to add another layer of checks to the responses to fix that. Doing so would make it more clear what is going on and where the issue comes from.

hmorzaria commented 8 years ago

I saw that issue: Elasticsearch error when trying to pull windows larger than 10k records. #18 in the API is now closed. The fix, increasing limit to 100k did not resolve this issue.

In the case of a polygon search, my workaround is to recursively reduce the area being searched until no error is returned

mjcollin commented 8 years ago

The 100k record limit for both the API and ridigbio (and the Python client as well) is unfortunately going to remain in place, perhaps permanently. Removing it will require our rewriting parts of the API code significantly and we can't commit to a timeline for doing that.

Have you tried working with the download API? https://www.idigbio.org/wiki/index.php/IDigBio_Download_API

There currently is no interface in ridigbio that wraps around the download API but we are in the process of changing the way downloads are generated so adding one will be much easier for us to do in the future.

So, your work-around of limiting your spatial extent to something that returns less that 100k records is something you will continue to have to do when using R.

I'll leave this open until I push a package update that removes the max_items parameter and issues a human readable error when there are more than 100k results.

Looping in @sckott , author of spocc so he knows about this too.

sckott commented 8 years ago

Thanks for the heads up. I'll have a look in spocc and see if I need to change anything

iDigBio / ridigbio

idig_search_records returns lexical error when retrieving large number of data using max_items or offset argument #33

When trying to download all records from a country, in this case Mexico, specifying as max_items the number of total records

Downloading one less that number of records

downloading the max number of records

Using offset and limit, will not allow offset higher than 50000