tdwg / esp

Earth Sciences and Paleobiology Interest Group
13 stars 10 forks source link

iDigBio taxonomic cleaning issues #28

Open hollyel opened 5 years ago

hollyel commented 5 years ago

A thread to document detailed examples of taxonomic data issues created as result of cleaning/validation algorithms on ingest into iDigBio.

Please include a description of the specific problem found, your query parameters, and, where possible, direct links to the records in question and any relevant data flags.

tkarim commented 5 years ago

Just browsing through my collections dataset, and I found another instance of a specimen that was only identified to subclass in my Specify database that has been backfilled on iDigBio.

To fins this record search for: recordset: d621e959-2633-4ec1-a2a2-5d97cd818b47 catalog number: 83425

amillhouse commented 5 years ago

Some examples of backfilling happening incorrectly:

Ctenodontidae exists as both a fish family and a bivalve family. The genus Clinopistha is a bivalve. But if you search in iDigBio for Genus: Clinopistha and view the Class, Order, and Family columns, MCZ and OMNH lack order level data, so the specimens have had the class backfilled to "sarcoptergyii" (fish) based on the family name. All other institutions with Clinopistha have it as Bivalvia (or Pelecypoda in USNM data).

Same thing with the bivalve genus Ctenodonta.

Taxon Related Data Flags include: dwc_taxonrank_added dwc_phylum_added dwc_taxonomicstatus_added gbif_genericname_added gbif_taxon_corrected gbif_canonicalname_added dwc_family_added dwc_class_added taxon_match_failed

amillhouse commented 5 years ago

Forward-filling error (?). In NMNH database, USNM PAL611803 is IDed as Tapirus sp.

iDigBio Search: Recordset: 6c6f34ed-58a4-4ba2-b9c7-34524f79a349 Catalog Number: PAL611803

Though iDigBio has Tapirus sp., it as also added the Infraspecific Epithet "terrestris," which doesn't exist in our dataset. image

Taxon Related Data Flags include: dwc_specificepithet_replaced gbif_canonicalname_added gbif_genericname_added gbif_taxon_corrected dwc_infraspecificepithet_added

amillhouse commented 5 years ago

There is also some sort of backend synonymy updating going on and those updates aren't coming from our internal USNM dataset. This is less specific to iDigBio, and it might be something more to do with our data (@hollyel ?).

Anyway, one of our horses in our EMu database is USNM V13637, Plesippus cf. proversus. In iDigBio, it shows up as Plesippus proversus. I downloaded the dataset from iDigBio and it contained the following data: idigbio plesippus

Out of curiosity, I brought up the record in GBIF...and was surprised to see a third genus, Pliohippus: gbif plesippus-pliohippus screenshot

PBDB has Equus proversus and Plesippus proversus as alternative combinations to Pliohippus proversus.

However, none of the other fossils IDed as Plesippus have been changed/updated, and Plesippus is a well established synonym of Equus. image

So what cleaning/validation algorithim is happening to Plesippus proversus that isn't happening to other specimens of Plesippus?

This isn't necessarily a problem persay, or at least not in this specific example. I didn't know our data was going through a synonymy update/algorithm and I'm not sure what is causing some records to be updated, but not others.

Regardless, I'm submitting this as a taxonomy thing that is happening :)

ljwalker commented 5 years ago

Record: https://www.idigbio.org/portal/records/8f4c03e6-cb39-4440-9d92-2daf5f407850 Catalog number: 2533.7

Issue: "ceres" is in the specific epithet field, but we cannot search for it in this field; however, we can search for it in the main search field

LACMIP gave the name "ceres" to iDigBio (unprocessed), but this was changed to "comanche" in the indexed data only

ljwalker commented 5 years ago

From Richard Garand (ACIS): In the record provided by Lindsay referenced in issue#28, ceres is the provided Specific Ephithet, the current process has indexed comanche for Specific Ephithet. On the Portal Search you would search for Specific Ephithet:Comanche (the indexed value). However the result table shows the raw value. there are a couple ways we can clarify this behavior.

ljwalker commented 5 years ago

We have classified termites as Order Blattodea: Infraorder Isoptera in EMu. However, some termite species have been reclassified in Order Isoptera whereas others remain unchanged from Order Blattodea.

screen shot 2018-11-29 at 2 00 31 pm
ekrimmel commented 4 years ago

From discussion with Nicholas on 2019-11-20: Maybe we can use dwc:identificationVerificationStatus as a way around what taxonomy gets cleaned, e.g. if the ID is listed as verified then it doesn’t get cleaned. But then this co-opts the real intent of that field. Would also take significant time from iDigBio to implement so they'd want to make sure it's a solution the community supports.

ekrimmel commented 4 years ago

Another specific examples of taxonomic data cleaning gone wrong: AMNH 108574 is identified as "Echinodermata," which iDigBio has changed to the beetle genus "Echinoderma"

Screen Shot 2020-09-04 at 11 43 27

This specimen has the correct taxonomy on GBIF: https://www.gbif.org/occurrence/1324326908

tkarim commented 4 years ago

It looks like for this case Taxon Rank (= Phylum) was reported for the record, but perhaps not taken into account by the taxonomic matching algorithm since all the taxonomy below phylum has been backfilled due to that incorrect match with Genus Echinoderma.