Open danstoner opened 6 years ago
An end-user reported issue thus:
I have a weird question about the iDigBio API. When I use the first of the two below links, I get an internal server error or sometimes a bad gateway error. When I use the second, it works. As far as I can tell, the only difference between the two links is that I replaced "Mugil" with "Gorilla". These are both genera for which there are records in iDigBio. Any ideas why this is?
https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&limit=100000
https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Gorilla%22}&limit=100000
Thanks much!
The API has some rough edges and they might need to stay rough for a while. I do not want the API to return 502 Bad Gateway in this case.
We also noticed some funkiness with our backend Elasticsearch service over the past few weeks (brief periods of unavailability) but have not nailed down the cause of that. I don't think that is related to your report but just mention it in case you start getting weird responses (or no responses).
In your reported case, one major difference is that genus Mugil returns many records, whereas I would not expect Gorilla to return very many.
The 500 Internal Server Error comes from the backend error condition bubbling its way out to you. The 502 Bad Gateway is from the frontend proxy server, which maybe means the backend crashed instead of issuing the 500.
The response size definitely seems to affect the behavior.
dan@D810:~$ http --headers --timeout 60 'https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&limit=100000'
HTTP/1.0 502 Bad Gateway
Cache-Control: no-cache
Connection: close
Content-Type: text/html
dan@D810:~$ http --headers --timeout 60 'https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&limit=10000'
HTTP/1.1 500 Internal Server Error
Cache-Control: public, max-age=300
Content-Length: 33
Content-Type: application/json; charset=utf-8
Date: Sun, 01 Apr 2018 14:33:39 GMT
Vary: Accept-Encoding, Origin
dan@D810:~$ http --headers --timeout 60 'https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&limit=1000'
HTTP/1.1 200 OK
Cache-Control: public, max-age=300
Content-Length: 91756921
Content-Type: application/json; charset=utf-8
Date: Sun, 01 Apr 2018 14:36:41 GMT
Last-Modified: Wed, 21 Mar 2018 15:57:28 GMT
Vary: Accept-Encoding, Origin
dan@D810:~$ http --headers --timeout 60 'https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Gorilla%22}&limit=100000'
HTTP/1.1 200 OK
Cache-Control: public, max-age=300
Content-Length: 10586442
Content-Type: application/json; charset=utf-8
Date: Sun, 01 Apr 2018 14:22:56 GMT
Last-Modified: Wed, 21 Mar 2018 15:57:28 GMT
Vary: Accept-Encoding, Origin
My next question is "what are you actually trying to get out of API?" One way to reduce the size of the result (and improve chances of success) is to use the "fields" parameter to return only those fields you actually need, rather than every field that we store. If I limit the results to a single field "uuid" the large query works ok.
dan@D810:~$ http --headers --timeout 60 'https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&fields=["uuid"]&limit=100000'
HTTP/1.1 200 OK
Cache-Control: public, max-age=300
Content-Length: 859884
Content-Type: application/json; charset=utf-8
Date: Sun, 01 Apr 2018 14:41:07 GMT
Last-Modified: Wed, 21 Mar 2018 15:57:28 GMT
Vary: Accept-Encoding, Origin
Seems to also work with 5 fields, probably more.
TODO: Add the "fields" tip prominently in the wiki.
Here are server-side logs when running query the big query
/v2/search/records/?rq={"genus":"Mugil"}&limit=100000
:
2018-04-02T13:16:12.882Z - info: 200.144.120.56 - "POST /v2/search/records/ HTTP/1.1" 200 85 - 11.730 ms
2018-04-02T13:16:13.008Z - error: uncaughtException: Invalid string length date=Mon Apr 02 2018 13:16:13 GMT+0000 (America), pid=251, uid=33, gid=33, cwd=/var/www, execPath=/usr/bin/node, version=v6.11.5, argv=[/usr/bin/node, /var/www/index.js], rss=706891776, heapTotal=727814144, heapUsed=589580656, external=4438743, loadavg=[0.2529296875, 0.14453125, 0.09912109375], uptime=868042, trace=[column=34, file=/var/www/node_modules/elasticsearch/src/lib/connectors/http.js, function=null, line=180, method=null, native=false, column=13, file=events.js, function=emitOne, line=96, method=null, native=false, column=7, file=events.js, function=IncomingMessage.emit, line=188, method=emit, native=false, column=18, file=_stream_readable.js, function=readableAddChunk, line=176, method=null, native=false, column=10, file=_stream_readable.js, function=IncomingMessage.Readable.push, line=134, method=push, native=false, column=22, file=_http_common.js, function=HTTPParser.parserOnBody, line=123, method=parserOnBody, native=false, column=20, file=_http_client.js, function=Socket.socketOnData, line=363, method=socketOnData, native=false, column=13, file=events.js, function=emitOne, line=96, method=null, native=false, column=7, file=events.js, function=Socket.emit, line=188, method=emit, native=false, column=18, file=_stream_readable.js, function=readableAddChunk, line=176, method=null, native=false, column=10, file=_stream_readable.js, function=Socket.Readable.push, line=134, method=push, native=false, column=20, file=net.js, function=TCP.onread, line=547, method=onread, native=false], stack=[RangeError: Invalid string length, at IncomingMessage.<anonymous> (/var/www/node_modules/elasticsearch/src/lib/connectors/http.js:180:34), at emitOne (events.js:96:13), at IncomingMessage.emit (events.js:188:7), at readableAddChunk (_stream_readable.js:176:18), at IncomingMessage.Readable.push (_stream_readable.js:134:10), at HTTPParser.parserOnBody (_http_common.js:123:22), at Socket.socketOnData (_http_client.js:363:20), at emitOne (events.js:96:13), at Socket.emit (events.js:188:7), at readableAddChunk (_stream_readable.js:176:18), at Socket.Readable.push (_stream_readable.js:134:10), at TCP.onread (net.js:547:20)]
2018-04-02T13:16:13.051Z - warn: Server(251) died.
This issue has been resolved. It was caused by a nested aggregation in ES
I looked at this issue yesterday, the below query still returns with a 500 error:
https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&limit=100000
Confused this with the map issue. It's probably the same issue. I'll take a look today.
@wilsotc @nrejack hi! Today, I noticed that the iDigBio api stops serving pages after 100k items.
with
iDigBio says 502 internal server error,
however, to a page request prior to that e.g.,
iDigBio says: 200 ok, and services a 100 items.
This limit is indicative of some elastic search client/server configuration.
I found a possibly related issue at https://stackoverflow.com/questions/41655913/how-do-i-retrieve-more-than-10000-results-events-in-elastic-search .
btw, restricting the fields doesn't seem to matter
e.g.,
the iDigBio API still says, 502 internal server error.
We have a download API that handles >100K records. That being said, I'm currently working on allowing the use of offset to retrieve unlimited records via the search API.
@wilsotc thanks for the update! Hoping to use the search api instead of the download api because the json formatted results work well for my use cases: crawling the idigbio registries and building static websites.
@wilsotc Would it be easier to post an exhaustive list of uuids like:
mediarecords.tsv:
recordset | record uuid |
---|---|
[some uuid] | [some media uuid] |
and
records.tsv:
recordset | record uuid |
---|---|
[some uuid] | [some record uuid] |
Alternatively, you can periodically provide a data dump of all json data in your elastic search cluster. Then, you can just tell folks like me: hey, there's all the data, go build your own index.
Our Elasticsearch cluster is populated by the idb-backend https://github.com/iDigBio/idb-backend project. Theoretically you could populate an Elasticsearch cluster with it. I'm not sure what it's inputs are but I could find out if you'd like to effort this.
@wilsotc thanks for the suggestion to use idb-backend as a way to re-use the iDigBio pipeline to transforms dwca and related eml into the json structures suitable for elasticsearch . I appreciate you took the time to respond.
At a quick glance, there's a tight coupling between the idb-backend's transformation from dwca -> json and loading this json into elastic search as well as the population of uuids and other meta-data of recordsets into the postgresdb. This tight coupling makes sense in the iDigBio context, because the data processing pipelines are very specific. However, this tight coupling would make it hard for me to re-use your valuable code base, simply because I have to install a wide array of dependencies (elastic search, postgres db, etc) to transform a dwca into their iDigBio xml/json representation.
Also, for integration with the iDigBio, the uuid generated as part of the ingestion process are needed. These uuids cannot be re-computed or inferred from the datasets, even if I would be able to re-use the idb-backend project.
Perhaps a way to work around the traditional limitations related to APIs (e.g., page size, page limits) is to publish a catalog of all uuids known to iDigBio and their associate types (e.g., media, record, recordset) and (optionally) the uuid of their containers. This lightweight list would serve as a catalog of all objects known to iDigBio and would enable a way to iterate through all, or a subset of, the idigbio objects.
example uuids.tsv with artificial examples -
uuid | type | parent-uuid |
---|---|---|
73b07816-6916-4e40-aa8f-2613d24da4f4 | recordset | |
e80e5fec-bcc3-4dc7-8cfa-cd42c6e302ce | record | 73b07816-6916-4e40-aa8f-2613d24da4f4 |
47a0231a-9685-411f-aa29-b97a3f25ce17 | mediarecord | e80e5fec-bcc3-4dc7-8cfa-cd42c6e302ce |
Alternatively, you can publish a uuid list per record set.
A cheap way to expose these is to make a query in your postgres database, dump all the uuids into a file and put these, along with a readme into some Zenodo publication. These uuids are literally the key to the vast repository of knowledge that iDigBio keeps, and cannot be re-produced as mentioned earlier. So, I figure that have a copy of all of them in some data publication wouldn't hurt and would also provide a starting point for whole dataset analysis of iDigBio indexed data.
Curious to hear your thoughts on how to make it easier to access iDigBio's valuable pool of interpreted datasets and the records they contain.
One point of interest is that we have that EXACT table in postrgres:
=> \d uuids
Table "uuids"
Column | Type | Modifiers
---------+-----------------------+------------------------
id | uuid | not null
type | character varying(50) | not null
parent | uuid |
deleted | boolean | not null default false
I think the challenge at the moment would be how to automate the publishing of this information in a repeatable fashion.
@danstoner neat! As far as automation goes . . . how about:
steps 1-3 can be automated in a bash script.
Step 3 would actually have quite a few decisions that would need to be made (naming, version retention, metadata), especially if you want it to be "findable" somehow.
Does Zenodo have an API for publishing new datasets?
Step 3 would actually have quite a few decisions that would need to be made (naming, version retention, metadata), especially if you want it to be "findable" somehow.
I agree, and you can start simple at first, like files:
uuids.tsv README
or
README <-- instructions on how to use the files recordsets.tsv <-- list of record set uuids 01c2ba22-2552-43fe-b639-cd7880efa327.tsv <-- file containing uuid tables for specific recordset 741a70c7-d4a0-404a-90ef-eb76d98cbfe4.tsv <-- file containing uuid tables for specific recordset ...
Does Zenodo have an API for publishing new datasets?
Yep. And I usually start with a manual workflow first.
And alternative to direct to Zenodo would be to group the uuids by recordset, and then check them into a github repo, enable a Zenodo integration and manage the periodic publications using GitHub releases. Grouping by recordset should keep the files small enough to fit into GitHub quota of <100MB per file.
In cases like this, the technical details of how/what to provide are usually the more straightforward. The bigger issues involve how to do so in a way that makes the most sense, best fits with the rest of the catalogue of services that iDigBio provides, and how to do so in a way that is sustainable going forward.
This is something that we can bring up and discuss in a broader team context.
Also of note is that it's on our list to begin making periodic dumps of idigbio data available again. Timeline TBD of course but it's something we'd very much like to re-enable.
@roncanepa thanks for chiming in. Whether or not you choose to address this issue is up to you, I understand that you and your iDigBio colleagues have a lot of work on your plate.
It would be helpful if you are clear on your priorities and mark this issue as a "do not fix". This way, I can move on to investing my time in alternate approaches to make existing infrastructures suited for whole dataset analysis.
Given the scope of our work and small team size, it's very difficult to list, rank, or discuss priorities because things often shift. This isn't necessary a "wontfix" situation, as we're mentioned two things (removing 100k limit, resuming data dumps) that would alleviate some of these issues. We just can't say when we'll get a chance to work on things and get them into production.
@roncanepa thanks for taking the time to clarify your situation - I'll continue working on non-iDigBio alternatives and am eager to hear when you make progress on ways to do whole dataset analysis of the iDigBio graph.
It turns out the API already supports this. I'm adding a section to the API wiki to cover this. All that is required is that you use a sort order and add a range to the end of your search parameters. The example below only uses a recordset. Just feed the last uuid from the previous page into the "gt" line to get the next page of records.
{
"rq": {
"recordset": "a6eee223-cf3b-4079-8bb2-b77dad8cae9d",
"uuid": {
"type": "range",
"gt": "0000bc6b-e2f1-4482-91d0-0d5cd76acb4b",
"lte": "ffffffff-ffff-ffff-ffff-ffffffffffff"
}
},
"sort": [
{
"uuid": "asc"
}
],
"limit": 100,
"offset": 0
}
You can also use larger limit values than 100.
I have added an example on our additional examples page written in Python to retrieve 100,000+ record query results. Let me know if this works for you or not.
https://github.com/iDigBio/idigbio-search-api/wiki/Additional-Examples
@wilsotc much appreciated - am feverishly attempting to confirm this method to workaround the 100k limitation. Meanwhile, I was wondering: would you advise injecting this range / sort for each when starting to request api pages ?
Yes, you'll need the range parameter below whatever your search parameters are as well as the sort order. You will also need each subsequent request to update the "gt" parameter within the range so that you'll get the next block of N records.
Try the example code to get a feel of it. If you have any questions, let me know.
Ah I see. Thanks for clarifying. What a neat trick! I'll continue to try and reproduce, implement this.
btw - I just stumbled across elastic search scrolling https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#scroll-search-results . Is this something you support?
I don't believe we utilize the scroll API internally and we can't directly expose the internal Elasticsearch REST API as it is also the administrative interface. This facility is probably the most appropriate for the download API as there is some overhead associated with it. It creates a snapshot that has a finite life span but may not be very burdensome. In a v3 API it will definitely be considered.
@wilsotc Thanks for sharing your perspective and insights on the elastic search scroll functionality.
I've played around with your workaround and it seems to work as expected.
It feels that if I implement the workaround, I am effectively building a facade on the iDigBio search API to enable exhaustive streaming of structured data from iDigBio.
Selfishly, I'd hope that the iDigBio engineering team would inject this uuid range/sort trick on handling search requests to make for a less limited api.
I am going to think about this over the weekend, and hoping to come up with some way to benefit from the insightful research you've done. Thanks for being patient.
In combination with the modified date field as an API parameter and sort key this could be used to live sync an external mirror duplication with low overhead for both the source and destination systems. As time allows I will look at this.
I have added exception handling to the code example after testing. This should prevent the script from failing when for whatever reason the search API query doesn't complete with a valid JSON object. Please take a look and let me know how it works for you when using a recordset as the query.
You might also be interested in the data retrieval example I added to the same documentation page. Examples
hey @wilsotc - as promised, I spent some time thinking about, and playing around with, your proposed solution. Given the complexity of the workaround (and testing it!) and the many edge cases (e.g., what to do when there's already a uuid range or sort order defined?), I am hesitant to implement this at this point.
Thanks again for proposing your clever workaround for easily accessing large number of idigbio records and ... perhaps I'll change my mind and implement the workaround anyway . . . especially because there's so much great stuff captured in your rich search indexes and image caches.
Add anything you like in the sort order as long as the uuid is the last field in the sort order this will work. For additional query values including ranges just add them in the rq field.
Depending on deployment architecture, may also cause 502 Bad Gateway from a frontend proxy server, etc.