iDigBio / idigbio-search-api

Server-side code driving iDigBio's search functionality.
GNU General Public License v3.0
24 stars 5 forks source link

large results cause 500 Internal Server Error #32

Open danstoner opened 6 years ago

danstoner commented 6 years ago

Depending on deployment architecture, may also cause 502 Bad Gateway from a frontend proxy server, etc.

danstoner commented 6 years ago

An end-user reported issue thus:

I have a weird question about the iDigBio API. When I use the first of the two below links, I get an internal server error or sometimes a bad gateway error. When I use the second, it works. As far as I can tell, the only difference between the two links is that I replaced "Mugil" with "Gorilla". These are both genera for which there are records in iDigBio. Any ideas why this is?

https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&limit=100000

https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Gorilla%22}&limit=100000

Thanks much!

danstoner commented 6 years ago

The API has some rough edges and they might need to stay rough for a while. I do not want the API to return 502 Bad Gateway in this case.

We also noticed some funkiness with our backend Elasticsearch service over the past few weeks (brief periods of unavailability) but have not nailed down the cause of that. I don't think that is related to your report but just mention it in case you start getting weird responses (or no responses).

In your reported case, one major difference is that genus Mugil returns many records, whereas I would not expect Gorilla to return very many.

The 500 Internal Server Error comes from the backend error condition bubbling its way out to you. The 502 Bad Gateway is from the frontend proxy server, which maybe means the backend crashed instead of issuing the 500.

The response size definitely seems to affect the behavior.

dan@D810:~$ http --headers --timeout 60 'https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&limit=100000'
HTTP/1.0 502 Bad Gateway
Cache-Control: no-cache
Connection: close
Content-Type: text/html

dan@D810:~$ http --headers --timeout 60 'https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&limit=10000'
HTTP/1.1 500 Internal Server Error
Cache-Control: public, max-age=300
Content-Length: 33
Content-Type: application/json; charset=utf-8
Date: Sun, 01 Apr 2018 14:33:39 GMT
Vary: Accept-Encoding, Origin

dan@D810:~$ http --headers --timeout 60 'https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&limit=1000'
HTTP/1.1 200 OK
Cache-Control: public, max-age=300
Content-Length: 91756921
Content-Type: application/json; charset=utf-8
Date: Sun, 01 Apr 2018 14:36:41 GMT
Last-Modified: Wed, 21 Mar 2018 15:57:28 GMT
Vary: Accept-Encoding, Origin

dan@D810:~$ http --headers --timeout 60 'https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Gorilla%22}&limit=100000'
HTTP/1.1 200 OK
Cache-Control: public, max-age=300
Content-Length: 10586442
Content-Type: application/json; charset=utf-8
Date: Sun, 01 Apr 2018 14:22:56 GMT
Last-Modified: Wed, 21 Mar 2018 15:57:28 GMT
Vary: Accept-Encoding, Origin

My next question is "what are you actually trying to get out of API?" One way to reduce the size of the result (and improve chances of success) is to use the "fields" parameter to return only those fields you actually need, rather than every field that we store. If I limit the results to a single field "uuid" the large query works ok.

dan@D810:~$ http --headers --timeout 60 'https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&fields=["uuid"]&limit=100000'
HTTP/1.1 200 OK
Cache-Control: public, max-age=300
Content-Length: 859884
Content-Type: application/json; charset=utf-8
Date: Sun, 01 Apr 2018 14:41:07 GMT
Last-Modified: Wed, 21 Mar 2018 15:57:28 GMT
Vary: Accept-Encoding, Origin

Seems to also work with 5 fields, probably more.

danstoner commented 6 years ago

TODO: Add the "fields" tip prominently in the wiki.

danstoner commented 6 years ago

Here are server-side logs when running query the big query /v2/search/records/?rq={"genus":"Mugil"}&limit=100000:


2018-04-02T13:16:12.882Z - info: 200.144.120.56 - "POST /v2/search/records/ HTTP/1.1" 200 85 - 11.730 ms
2018-04-02T13:16:13.008Z - error: uncaughtException: Invalid string length date=Mon Apr 02 2018 13:16:13 GMT+0000 (America), pid=251, uid=33, gid=33, cwd=/var/www, execPath=/usr/bin/node, version=v6.11.5, argv=[/usr/bin/node, /var/www/index.js], rss=706891776, heapTotal=727814144, heapUsed=589580656, external=4438743, loadavg=[0.2529296875, 0.14453125, 0.09912109375], uptime=868042, trace=[column=34, file=/var/www/node_modules/elasticsearch/src/lib/connectors/http.js, function=null, line=180, method=null, native=false, column=13, file=events.js, function=emitOne, line=96, method=null, native=false, column=7, file=events.js, function=IncomingMessage.emit, line=188, method=emit, native=false, column=18, file=_stream_readable.js, function=readableAddChunk, line=176, method=null, native=false, column=10, file=_stream_readable.js, function=IncomingMessage.Readable.push, line=134, method=push, native=false, column=22, file=_http_common.js, function=HTTPParser.parserOnBody, line=123, method=parserOnBody, native=false, column=20, file=_http_client.js, function=Socket.socketOnData, line=363, method=socketOnData, native=false, column=13, file=events.js, function=emitOne, line=96, method=null, native=false, column=7, file=events.js, function=Socket.emit, line=188, method=emit, native=false, column=18, file=_stream_readable.js, function=readableAddChunk, line=176, method=null, native=false, column=10, file=_stream_readable.js, function=Socket.Readable.push, line=134, method=push, native=false, column=20, file=net.js, function=TCP.onread, line=547, method=onread, native=false], stack=[RangeError: Invalid string length,     at IncomingMessage.<anonymous> (/var/www/node_modules/elasticsearch/src/lib/connectors/http.js:180:34),     at emitOne (events.js:96:13),     at IncomingMessage.emit (events.js:188:7),     at readableAddChunk (_stream_readable.js:176:18),     at IncomingMessage.Readable.push (_stream_readable.js:134:10),     at HTTPParser.parserOnBody (_http_common.js:123:22),     at Socket.socketOnData (_http_client.js:363:20),     at emitOne (events.js:96:13),     at Socket.emit (events.js:188:7),     at readableAddChunk (_stream_readable.js:176:18),     at Socket.Readable.push (_stream_readable.js:134:10),     at TCP.onread (net.js:547:20)]
2018-04-02T13:16:13.051Z - warn: Server(251) died.
wilsotc commented 4 years ago

This issue has been resolved. It was caused by a nested aggregation in ES

nrejac commented 4 years ago

I looked at this issue yesterday, the below query still returns with a 500 error:

https://search.idigbio.org/v2/search/records/?rq={%22genus%22:%20%22Mugil%22}&limit=100000

wilsotc commented 4 years ago

Confused this with the map issue. It's probably the same issue. I'll take a look today.

jhpoelen commented 4 years ago

@wilsotc @nrejack hi! Today, I noticed that the iDigBio api stops serving pages after 100k items.

with

https://search.idigbio.org/v2/search/records/?rq=%7B%22recordset%22%3A+%22eaa5f19e-ff6f-4d09-8b55-4a6810e77a6c%22%7D&limit=100&offset=100000

iDigBio says 502 internal server error,

however, to a page request prior to that e.g.,

https://search.idigbio.org/v2/search/records/?rq=%7B%22recordset%22%3A+%22eaa5f19e-ff6f-4d09-8b55-4a6810e77a6c%22%7D&limit=100&offset=99900

iDigBio says: 200 ok, and services a 100 items.

This limit is indicative of some elastic search client/server configuration.

I found a possibly related issue at https://stackoverflow.com/questions/41655913/how-do-i-retrieve-more-than-10000-results-events-in-elastic-search .

jhpoelen commented 4 years ago

btw, restricting the fields doesn't seem to matter

e.g.,

https://search.idigbio.org/v2/search/records/?rq={%22recordset%22%3A+%22eaa5f19e-ff6f-4d09-8b55-4a6810e77a6c%22}&fields=[%22uuid%22]&limit=100&offset=100000

the iDigBio API still says, 502 internal server error.

wilsotc commented 4 years ago

We have a download API that handles >100K records. That being said, I'm currently working on allowing the use of offset to retrieve unlimited records via the search API.

jhpoelen commented 4 years ago

@wilsotc thanks for the update! Hoping to use the search api instead of the download api because the json formatted results work well for my use cases: crawling the idigbio registries and building static websites.

jhpoelen commented 4 years ago

@wilsotc Would it be easier to post an exhaustive list of uuids like:

mediarecords.tsv:

recordset record uuid
[some uuid] [some media uuid]

and

records.tsv:

recordset record uuid
[some uuid] [some record uuid]
jhpoelen commented 4 years ago

Alternatively, you can periodically provide a data dump of all json data in your elastic search cluster. Then, you can just tell folks like me: hey, there's all the data, go build your own index.

wilsotc commented 4 years ago

Our Elasticsearch cluster is populated by the idb-backend https://github.com/iDigBio/idb-backend project. Theoretically you could populate an Elasticsearch cluster with it. I'm not sure what it's inputs are but I could find out if you'd like to effort this.

jhpoelen commented 4 years ago

@wilsotc thanks for the suggestion to use idb-backend as a way to re-use the iDigBio pipeline to transforms dwca and related eml into the json structures suitable for elasticsearch . I appreciate you took the time to respond.

At a quick glance, there's a tight coupling between the idb-backend's transformation from dwca -> json and loading this json into elastic search as well as the population of uuids and other meta-data of recordsets into the postgresdb. This tight coupling makes sense in the iDigBio context, because the data processing pipelines are very specific. However, this tight coupling would make it hard for me to re-use your valuable code base, simply because I have to install a wide array of dependencies (elastic search, postgres db, etc) to transform a dwca into their iDigBio xml/json representation.

Also, for integration with the iDigBio, the uuid generated as part of the ingestion process are needed. These uuids cannot be re-computed or inferred from the datasets, even if I would be able to re-use the idb-backend project.

Perhaps a way to work around the traditional limitations related to APIs (e.g., page size, page limits) is to publish a catalog of all uuids known to iDigBio and their associate types (e.g., media, record, recordset) and (optionally) the uuid of their containers. This lightweight list would serve as a catalog of all objects known to iDigBio and would enable a way to iterate through all, or a subset of, the idigbio objects.

example uuids.tsv with artificial examples -

uuid type parent-uuid
73b07816-6916-4e40-aa8f-2613d24da4f4 recordset
e80e5fec-bcc3-4dc7-8cfa-cd42c6e302ce record 73b07816-6916-4e40-aa8f-2613d24da4f4
47a0231a-9685-411f-aa29-b97a3f25ce17 mediarecord e80e5fec-bcc3-4dc7-8cfa-cd42c6e302ce

Alternatively, you can publish a uuid list per record set.

A cheap way to expose these is to make a query in your postgres database, dump all the uuids into a file and put these, along with a readme into some Zenodo publication. These uuids are literally the key to the vast repository of knowledge that iDigBio keeps, and cannot be re-produced as mentioned earlier. So, I figure that have a copy of all of them in some data publication wouldn't hurt and would also provide a starting point for whole dataset analysis of iDigBio indexed data.

Curious to hear your thoughts on how to make it easier to access iDigBio's valuable pool of interpreted datasets and the records they contain.

danstoner commented 4 years ago

One point of interest is that we have that EXACT table in postrgres:

=> \d uuids
                   Table "uuids"
 Column  |         Type          |       Modifiers        
---------+-----------------------+------------------------
 id      | uuid                  | not null
 type    | character varying(50) | not null
 parent  | uuid                  | 
 deleted | boolean               | not null default false

I think the challenge at the moment would be how to automate the publishing of this information in a repeatable fashion.

jhpoelen commented 4 years ago

@danstoner neat! As far as automation goes . . . how about:

  1. export table to tsv
  2. sort the tsv
  3. publish the tsv

steps 1-3 can be automated in a bash script.

danstoner commented 4 years ago

Step 3 would actually have quite a few decisions that would need to be made (naming, version retention, metadata), especially if you want it to be "findable" somehow.

Does Zenodo have an API for publishing new datasets?

jhpoelen commented 4 years ago

Step 3 would actually have quite a few decisions that would need to be made (naming, version retention, metadata), especially if you want it to be "findable" somehow.

I agree, and you can start simple at first, like files:

uuids.tsv README

or

README <-- instructions on how to use the files recordsets.tsv <-- list of record set uuids 01c2ba22-2552-43fe-b639-cd7880efa327.tsv <-- file containing uuid tables for specific recordset 741a70c7-d4a0-404a-90ef-eb76d98cbfe4.tsv <-- file containing uuid tables for specific recordset ...

Does Zenodo have an API for publishing new datasets?

Yep. And I usually start with a manual workflow first.

And alternative to direct to Zenodo would be to group the uuids by recordset, and then check them into a github repo, enable a Zenodo integration and manage the periodic publications using GitHub releases. Grouping by recordset should keep the files small enough to fit into GitHub quota of <100MB per file.

roncanepa commented 4 years ago

In cases like this, the technical details of how/what to provide are usually the more straightforward. The bigger issues involve how to do so in a way that makes the most sense, best fits with the rest of the catalogue of services that iDigBio provides, and how to do so in a way that is sustainable going forward.

This is something that we can bring up and discuss in a broader team context.

Also of note is that it's on our list to begin making periodic dumps of idigbio data available again. Timeline TBD of course but it's something we'd very much like to re-enable.

jhpoelen commented 4 years ago

@roncanepa thanks for chiming in. Whether or not you choose to address this issue is up to you, I understand that you and your iDigBio colleagues have a lot of work on your plate.

It would be helpful if you are clear on your priorities and mark this issue as a "do not fix". This way, I can move on to investing my time in alternate approaches to make existing infrastructures suited for whole dataset analysis.

roncanepa commented 4 years ago

Given the scope of our work and small team size, it's very difficult to list, rank, or discuss priorities because things often shift. This isn't necessary a "wontfix" situation, as we're mentioned two things (removing 100k limit, resuming data dumps) that would alleviate some of these issues. We just can't say when we'll get a chance to work on things and get them into production.

jhpoelen commented 4 years ago

@roncanepa thanks for taking the time to clarify your situation - I'll continue working on non-iDigBio alternatives and am eager to hear when you make progress on ways to do whole dataset analysis of the iDigBio graph.

wilsotc commented 4 years ago

It turns out the API already supports this. I'm adding a section to the API wiki to cover this. All that is required is that you use a sort order and add a range to the end of your search parameters. The example below only uses a recordset. Just feed the last uuid from the previous page into the "gt" line to get the next page of records.

{
  "rq": {
    "recordset": "a6eee223-cf3b-4079-8bb2-b77dad8cae9d",
    "uuid": {
      "type": "range",
      "gt": "0000bc6b-e2f1-4482-91d0-0d5cd76acb4b",
      "lte": "ffffffff-ffff-ffff-ffff-ffffffffffff"
    }
  },
  "sort": [
    {
      "uuid": "asc"
    }
  ],
  "limit": 100,
  "offset": 0
}
wilsotc commented 4 years ago

You can also use larger limit values than 100.

wilsotc commented 4 years ago

I have added an example on our additional examples page written in Python to retrieve 100,000+ record query results. Let me know if this works for you or not.

https://github.com/iDigBio/idigbio-search-api/wiki/Additional-Examples

jhpoelen commented 4 years ago

@wilsotc much appreciated - am feverishly attempting to confirm this method to workaround the 100k limitation. Meanwhile, I was wondering: would you advise injecting this range / sort for each when starting to request api pages ?

wilsotc commented 4 years ago

Yes, you'll need the range parameter below whatever your search parameters are as well as the sort order. You will also need each subsequent request to update the "gt" parameter within the range so that you'll get the next block of N records.

Try the example code to get a feel of it. If you have any questions, let me know.

jhpoelen commented 4 years ago

Ah I see. Thanks for clarifying. What a neat trick! I'll continue to try and reproduce, implement this.

jhpoelen commented 4 years ago

btw - I just stumbled across elastic search scrolling https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#scroll-search-results . Is this something you support?

wilsotc commented 4 years ago

I don't believe we utilize the scroll API internally and we can't directly expose the internal Elasticsearch REST API as it is also the administrative interface. This facility is probably the most appropriate for the download API as there is some overhead associated with it. It creates a snapshot that has a finite life span but may not be very burdensome. In a v3 API it will definitely be considered.

jhpoelen commented 4 years ago

@wilsotc Thanks for sharing your perspective and insights on the elastic search scroll functionality.

I've played around with your workaround and it seems to work as expected.

It feels that if I implement the workaround, I am effectively building a facade on the iDigBio search API to enable exhaustive streaming of structured data from iDigBio.

Selfishly, I'd hope that the iDigBio engineering team would inject this uuid range/sort trick on handling search requests to make for a less limited api.

I am going to think about this over the weekend, and hoping to come up with some way to benefit from the insightful research you've done. Thanks for being patient.

wilsotc commented 4 years ago

In combination with the modified date field as an API parameter and sort key this could be used to live sync an external mirror duplication with low overhead for both the source and destination systems. As time allows I will look at this.

wilsotc commented 4 years ago

I have added exception handling to the code example after testing. This should prevent the script from failing when for whatever reason the search API query doesn't complete with a valid JSON object. Please take a look and let me know how it works for you when using a recordset as the query.

You might also be interested in the data retrieval example I added to the same documentation page. Examples

jhpoelen commented 4 years ago

hey @wilsotc - as promised, I spent some time thinking about, and playing around with, your proposed solution. Given the complexity of the workaround (and testing it!) and the many edge cases (e.g., what to do when there's already a uuid range or sort order defined?), I am hesitant to implement this at this point.

Thanks again for proposing your clever workaround for easily accessing large number of idigbio records and ... perhaps I'll change my mind and implement the workaround anyway . . . especially because there's so much great stuff captured in your rich search indexes and image caches.

wilsotc commented 4 years ago

Add anything you like in the sort order as long as the uuid is the last field in the sort order this will work. For additional query values including ranges just add them in the rq field.