scan-bugs-org / North-American-Arthropod-Biodiversity

GNU General Public License v3.0
2 stars 0 forks source link

Taxonomy #2

Open seltmann opened 4 years ago

seltmann commented 4 years ago
seltmann commented 4 years ago

Catalog of Life sqllite db is in our shared Dropbox under Biodiversity_NA_2018/Catalog-Of-Life.

Link to a view only: https://www.dropbox.com/sh/lc8pycmsl5vda54/AABXNzqZaW5VJTpiibbr3Qo5a?dl=0

Right now this is the full COL database. I can also create that is limited to Arthropoda.

seltmann commented 4 years ago

First tasks include:

  1. Determining if it contains all of the information we need
  2. Determining if it includes way of filtering based on NA taxa
seltmann commented 4 years ago

Questions for this process include: Are names in SCAN also in COL. How well do they match? And what is missing from COL that is in SCAN.

seltmann commented 4 years ago

From the November 19 meeting, outlining some preliminary tasks and people taking the lead at looking at taxonomies.

@njdowdy will start looking at GBIF (same as COL)? and add to a sqlite db @lindsiebug will look at SCAN db after Thanksgiving @seltmann will start looking at COL Friday

  1. Compare taxonomies from COL, GBIF, ITIS, SCAN. All downloads will be saved as sqLite files in the DropBox folder (I made a higher level folder – Taxonomy_Databases)

\Biodiversity_NA_2020\Taxonomy_Databases\Catalog-OF-Life

Nic will download GBIF, Lindsie – SCAN, ITIS - ?, Katja’s student – COL.

Screen Shot 2019-11-19 at 12 25 21 PM
njdowdy commented 4 years ago

We may just want to use GBIF's backbone taxonomy as it supposedly incorporates many sources, including, but not limited to:

@seltmann @neilcobb

Here is the GBIF Backbone Taxonomy (last updated 2019-09-06), both a complete version and one restricted to arthropods only. https://1drv.ms/u/s!AhKIbI2yCfSSltIfJTUplbOuw4D1PA?e=WmTH0N

I've stored them as CSVs, but they should be easily imported into any database format. These files include synonyms as well as valid names. Unfortunately, AFAIK, GBIF doesn't make this data available in a truly database-friendly way (e.g., it fails basic database normalization principles). I could fix that, but it would take some time and I'm not sure we really need it for this exercise.

EDIT: I forgot to mention that I have yet to find a publicly available GBIF data set that would allow us to easily filter taxa by whether they occur in North America or not. We could query georeferenced data by taxon and retain taxa which contain at least one NA record. But then, if there are no records for a given taxon somewhere online, many taxa could be falsely filtered out. Something to discuss.

neilcobb commented 4 years ago

I just sent the following email to the GBIF helpdesk, I would suggest writing directly to Tim Robertson if the helpdesk does not have an answer: @seltmann @njdowdy This is not a data provider question but a group of us are initiating a publication on North American arthropods and we need GBIF data set that would allow us to easily filter taxa by whether they occur in North America or not. We could query georeferenced data by taxon and retain taxa which contain at least one NA record. But then, if there are no records for a given taxon somewhere online, many taxa could be falsely filtered out. We are not expecting to obtain anything close to the 142,000 estimated species but it seems like GBIF has the most complete taxonomy available. This will allow us to apply percentage digitized estimates are for a hierarchy of taxonomic values from species to orders found in North America. Is there any way to obtain such a list.

neilcobb commented 4 years ago

@njdowdy @seltmann

GBIF is only 43% overlap with COL

image

seltmann commented 4 years ago

Wow! @neilcobb the plot thickens. I wonder if thats the same percent for arthropods.

neilcobb commented 4 years ago

here is a partial answer @seltmann @njdowdy

image

It seems like country is implemented in GBIF SPECIES API

image

seltmann commented 4 years ago

I looked at COL for geographic information (https://github.com/maridez6399/Taxon-Name-Exploration/blob/master/COL.html). This also includes a complete list of the distinct geographic locations stored in the database.

From my evaluation, we can conclude that COL stores information about geographic location coming from catalogs but does not have a definitive list of geographic information for species.

Screen Shot 2019-11-25 at 4 57 00 PM

seltmann commented 4 years ago

The basic numbers for COL arthropod taxa. Screen Shot 2019-11-25 at 5 19 38 PM

https://github.com/maridez6399/Taxon-Name-Exploration/blob/master/COL-speciesCounts.html

neilcobb commented 4 years ago

@seltmann @njdowdy I filtered for 3 NA countries and arthropods and then: GBIF specimen-only data shows 176,313 total taxonomic entries (comparable to your results) GBIF all basisofrecord shows 309,392 total taxonomic entries (more than your results)

I assume they only show accepted names? forgot to remove extinct....just made a separate list of 15,888 total, so not a lot

njdowdy commented 4 years ago

@neilcobb @seltmann

I've gone through the GBIF data. These values use 'accepted' names only, exclude subspecies, and 'extant' means not a member of Trilobita nor Merostomata. Quick summary is:

Here are a bunch of other breakdowns that we discussed last meeting: https://1drv.ms/x/s!AhKIbI2yCfSSltIoNEeL6_PKlSCVIA?e=QclzxA

sheet1: count of orders, families, genera, and species by class sheet2: count of families, genera, and species by order sheet3: count of genera, and species by family sheet4: count of species by genus sheet5: count of classes, orders, families, genera, species, subspecies sheet6: how many unranked classes, orders, families, genera, species e.g. "How many unranked Lepidoptera have their highest identified rank at the level of Order?" (see "order_count" value for Lepidoptera) sheet7: list all orders sheet8: list all families sheet9: list all groups containing unranked taxa

neilcobb commented 4 years ago

@seltmann @njdowdy I just noticed on your GitHub comment what is defined as extant…….but that does not exclude extinct species in extant classes, orders, families, genera???

I think using GBIF and filtering for country for extant in NA would be best. It would include all basisofrecord but even if we focus on specimens it seems reasonable to include all sources to get counts of taxa.

I compiled my old list and it might be useful for some families but only 99,000 species and a lot of missing families.

Taxa that are not including are likely rare enough to not impact expected relative numbers.

lindsiebug commented 4 years ago

@seltmann @njdowdy @neilcobb

here are the numbers I got for SCAN from the taxonomy table that Evin downloaded

Total Number of Classes in SCAN is 36 Total Number of Orders in SCAN is 214 Total Number of Families in SCAN is 3,339 Total Number of Genera in SCAN is 103,821 Total Number of Species in SCAN is 1,045,359

I'll update you guys with other information we wanted later.

neilcobb commented 4 years ago

@seltmann @njdowdy @neilcobb @lindsiebug I'm not sure why the number of classes in orders is so high and SCAN it makes me think that that's including things that may not be arthropods that got in there. But the number of species and genera looks like it's appropriate. So we have most of the taxa around the world but clearly not all of the taxa. I assume these are all accepted names.

lindsiebug commented 4 years ago

@neilcobb @njdowdy @seltmann yes, I put this list of name in dropbox for all five categories ...\Biodiversity_NA_2018\Taxonomy_Databases\SCAN_taxonomy_table\lists_of_taxa

njdowdy commented 4 years ago

@seltmann @neilcobb @lindsiebug This is RE: an email from Neil. Thought I'd just put it here.

To be clear about my decision on extinct taxa: the "extant" flag is not available in the GBIF backbone. That's why I had to be explicit about the lineages I excluded. So, the numbers I gave above are overestimates of extant diversity because they include extinct lineages that are nested within extant groups. I'd need a field that defines extant and extinct taxa to filter on to make it more granular.

I think the obvious path forward is to construct an "extant" field for the GBIF backbone, assign "?" to everything, join the GBIF backbone to the COL backbone, and overwrite the "extant" status based on the COL "extant" annotation. That will leave us with a subset (i.e., the GBIF names that aren't in COL) that don't get a "yes" or "no" annotation. We'd need some other source to handle the annotations of the remaining "?"'s.

I'll add it to the to-do list.

neilcobb commented 4 years ago

@seltmann @neilcobb @lindsiebug

I realized I was using Fossil Record in basisofrecord and that does not mean it is extinct, what a bonehead.