paleobiodb / bug_reports

Description of recent enhancements to the Paleobiology Database and project management
7 stars 4 forks source link

Diversity end-point for species count is not limiting diversity to species (resulting in over-counting) #17

Closed jbryan6 closed 6 years ago

jbryan6 commented 6 years ago

The following API request:

https://paleobiodb.org/data1.2/occs/diversity.txt?base_name=Dinosauria&count=species

returns 187 sampled_in_bin results for the Cenomanian stage. Examining the 586 occurrences here:

https://paleobiodb.org/data1.2/occs/list.txt?base_name=dinosauria&interval=cenomanian

it appears that only 99 distinct species are actually present. The current result set includes all distinct accepted_name values, and does not limit the results to "accepted_rank = species". Consequently, the 187 sampled_in_bin results include distinct genera, families, etc.

This problem appears to be limited to the species count. Using the parameter "count=genera" in the diversity endpoint returns the correct count of 106 distinct genera (24 distinct values from accepted_rank = genera, plus 99 distinct values from accepted_rank species, minus the 17 repeated genera values from accepted_rank species).

mmcclenn commented 6 years ago

I have updated the diversity code to fix this issue. The update is now available for testing, you can try it as follows:

https://training.paleobiodb.org/data1.2/occs/diversity.txt?base_name=Dinosauria&count=species

If it works correctly, I will push the update to our main server. In order to make this testing easier, I have added a new operation which allows you to check the diversity calculations. This is documented at:

https://training.paleobiodb.org/data1.2/occs/checkdiv

and can be used as follows:

https://training.paleobiodb.org/data1.2/occs/checkdiv.txt?base_name=Dinosauria&count=species&list=cenomanian

https://training.paleobiodb.org/data1.2/occs/checkdiv.txt?base_name=Dinosauria&count=species&diag=cenomanian

I don't think I have documented it very clearly, but will be happy to explain to any of you what it does.

jbryan6 commented 6 years ago

This fix looks to have corrected the issue. I examined the Cenomanian Dinosauria, Turonian Dinosauria, and Turonian Scleractinia occurrence data sets for testing. All three data sets ticked-and-tied correctly with both genera and species diversity matching on all points. Great work!

The diagnosis field was very helpful, by the way, to see how the data was being put together. Great addition.

The one item that I wasn't sure about was the new "implied_in_bin" response field in the diversity end point. It may not matter, but I couldn't clearly understand why certain values were included and others excluded. For example, in this URL:

https://training.paleobiodb.org/data1.2/occs/checkdiv.txt?base_name=Dinosauria&count=species&list=cenomanian

the following value is returned:

"117","Cenomanian","implied","67049","Anomoepus","Anomoepus","583969"

It's not clear to me why this is returned but Allosaurus is not. Allosaurus has two occurrences resolved at the genera level in the interval (1345735, 1345736), but with no species occurrences in the interval. This would seem to match the definition in the data service documentation. I can't seem to find any clear differences that would indicate why Anomepus is being included but Allosaurus is not. It's very possible I'm missing something obvious, but the way the definition is worded I would expect Allosaurus to be an implied species in the bin.

Let me add that including the implied in bin is great - a very important piece of information that I hadn't even realized I was overlooking until you added this.

mmcclenn commented 6 years ago

@jbryan6, thanks for checking the values of implied_in_bin. I had made a logic error which was mis-counting them. Allosaurus and several others are now counted.

I am still trying to figure out the best rule for computing the implied_in_bin value, and would love to talk to you more about it.

Also, I just added two more output blocks to the occs/list operation: 'timebins' and 'timecompare'. The former shows you which time intervals each occurrence gets binned in under the current timerule, and the latter shows you for all of the four timerules. This allows you to compare between them. This was Mark's idea, and I think it is a good one. Please check this out too and let me know what you think.

jpjenk commented 6 years ago

This should be fully resolved including the new diagnostic route in the return. Has this been pushed to the production server?

mmcclenn commented 6 years ago

Yes, this change is now on the production server.