iDigBio / idigbio-search-api

Server-side code driving iDigBio's search functionality.
GNU General Public License v3.0
24 stars 5 forks source link

Add summary endpoint for number of species #27

Open fmichonneau opened 6 years ago

fmichonneau commented 6 years ago

It would be nice to have a summary endpoint (similar to summary/top/records/ and summary/count/records/) that would return the number of species (e.g. distinct scientificname) for a given query. That would allow to answer questions such as "how many species of phylum X are in country Y?"

mjcollin commented 6 years ago

We've talked about this as a "unique values" API endpoint, ie "Show me the unique values of this field and their counts", adding a query to filter the records as you describe above like "phylum == X and country ==Y" would be a good refinement.

The difficulty is that Elastic Search is great at top-style queries that don't rely on collecting 100% of results and terrible at distinct and count type things. We're evaluating how to provide this in a performant manner. @godfoder

If you have an immediate research need, these are really easy to do in Spark and we can talk about how to get numbers you need off our cluster:

https://github.com/bio-guoda/guoda-examples/blob/master/iDigBio%20Country%20Checklist.ipynb

(Rendering that seems busted at the moment but it's typical filter, grouby, count stuff.)