inspirehep / rest-api-doc

Documentation of the INSPIRE REST API
https://inspirehep.net
Creative Commons Attribution Share Alike 4.0 International
40 stars 10 forks source link

Upper bound of 10000 queries means I can't access the entirety of INSPIRE institutions #20

Open smeehan12 opened 2 years ago

smeehan12 commented 2 years ago

I am trying to use the API to scrape the geographical distribution information for publications throughout the world to get a handle on the differences of publications by institutions located in different regions of the world. As such, I am trying to make calls to URLs like

https://inspirehep.net/api/institutions?sort=mostrecent&size=1&page=1

which allows me to query the metadata associated with the insitutional publication records.

This works well and allows me to get all the information I need. However, there seems to be an upper limit on being able to access all of the data because when I try a call like

https://inspirehep.net/api/institutions?sort=mostrecent&size=10&page=1001

I get a return of

{"status": 400, "message": "Maximum number of 10000 results have been reached."}

Now, I see that there is a maximum number of simultaneous returns that can be requested of 1000, but this upper bound of 10000 is causing issues because it means I can't access the data for the full set of 11791 institutions that have publications in HEP via this API.

Is there some reason why this upper bound exists? Or am I misusing the API?

michamos commented 2 years ago

You're doing everything right, this is an unfortunate limitation on our side (ElasticSearch is used as a search engine, but the API we're using for pagination has a limit at 10000 results). I hope we can improve this soon by switching to a different pagination mechanism, but in the meantime you can use the following workaround.

Add to the search query (which is empty in your case) an additional filter ensuring that you receive less than 10000 results back for a single search, then manually change the values you're filtering on. It's convenient to use a range of control_number values for this, as all records are guaranteed to contain exactly one control_number. For Institutions (which uses the standard ES query_string parser), this would look like

https://inspirehep.net/api/institutions?sort=mostrecent&size=1&page=1&q=control_number%3A[1 TO 1000000]
https://inspirehep.net/api/institutions?sort=mostrecent&size=1&page=1&q=control_number%3A[1000001 TO 2000000]

For Literature, which has a much higher density of records and uses a custom query parser, you'd do something like

https://inspirehep.net/api/literature?sort=mostrecent&size=1&page=1&q=control_number%3A1->10000
https://inspirehep.net/api/literature?sort=mostrecent&size=1&page=1&q=control_number%3A10001->20000
[etc.]
smeehan12 commented 2 years ago

What you write here is working very well. Thank you for adding it to the documentation. I think it will be clear how to circumvent this issue if someone starts using the API for their own project and find issues.

Please keep up the great work on developing this infrastructure, it is crucial for meta-analyses and I look forward to sharing the results of our project with you when they come to fruition!

javadebadi commented 2 years ago

Hi @michamos It would be nice to get the list of all available control_numbers or record ids. If it is possible, probably the 10000 upper bound would not be a serious problem for now. I think that if you add the following new routes, it will be very beneficial to users:

  1. a route to get a list of all author ids in Inspirehep (/api/authors/ids/)
  2. a route to get a list of all institution ids in Inspirehep (/api/institutions/ids)
  3. a route to get a list of all literature ids in Inspirehep (/api/literature/ids)
  4. a route to get a list of all seminar ids in Inspirehep (/api/seminars/ids)
  5. a route to get a list of all job ids in Inspirehep (/api/jobs/ids/)
  6. a route to get a list of all conference ids in Inspirehep (/api/conferences/ids)
    1. a route to get a list of all experiment ids in Inspirehep (/api/experiments/ids)

Probably there could be a natural sorting for all the objects and the users can get top 10, 1000, etc ... items in the list by specifying a query parameter.