hurlbertlab / dietdatabase

Creative Commons Zero v1.0 Universal
10 stars 9 forks source link

deal with taxonomic name matches across multiple kingdoms during name cleaning #146

Open ahhurlbert opened 3 years ago

ahhurlbert commented 3 years ago

Some taxonomic names (especially genera) are valid names that represent distinct entities in different kingdoms.

For example, "Dilophus" is a genus of bibionid fly (Diptera), but also a genus of brown alga within Chromista.

Current taxonomic name cleaning functions sometimes fill in higher taxonomic levels (and Prey_Name_ITIS_ID) of the wrong entity.

Not sure how to catch this automatically--it may just require more vigilance when running the cleaning function since R should prompt the user to select which entity is intended and it seems that sometimes the wrong entity has been selected.

jhpoelen commented 3 years ago

Hey @ahhurlbert - re: honomym detection - I assume that only a very small minority of the names are homonyms, and making a list of homonyms should be doable with run-of-the-mill taxon matching libraries. With this, a short name list can be compiled for extra careful name review. You probably already thought of this or similar approached, but I'd figure to mention it anyways.

E.g., matching "Dilophus" using nomer (my tools of choice, but other more R oriented tools would work ok), shows two hits against ITIS:

$ echo -e "\tDilophus" | nomer append globi-globalnames  | grep ITIS
using matcher [globi-globalnames]
    Dilophus    SYNONYM_OF  ITIS:121401 Dilophus    Genus   Animalia | Bilateria | Protostomia | Ecdysozoa | Arthropoda | Hexapoda | Insecta | Pterygota | Neoptera | Holometabola | Diptera | Nematocera | Bibionomorpha | Bibionidae | Dilophus   ITIS:202423 | ITIS:914154 | ITIS:914155 | ITIS:914158 | ITIS:82696 | ITIS:563886 | ITIS:99208 | ITIS:100500 | ITIS:563890 | ITIS:914213 | ITIS:118831 | ITIS:118832 | ITIS:121303 | ITIS:121316 | ITIS:121401   Kingdom | Subkingdom | Infrakingdom | Superphylum | Phylum | Subphylum | Class | Subclass | Infraclass | Superorder | Order | Suborder | Infraorder | Family | Genus    http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=121401    
    Dilophus    SAME_AS ITIS:11181  Dilophus    Genus       Chromista | Chromista | Phaeophyta | Phaeophyceae | Dictyotales | Dictyotaceae | Dilophus   ITIS:630578 | ITIS:590735 | ITIS:660055 | ITIS:10686 | ITIS:11148 | ITIS:11149 | ITIS:11181 Kingdom | Subkingdom | Division | Class | Order | Family | Genus    http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=11181 
ahhurlbert commented 3 years ago

Thanks @jhpoelen Our name cleaning makes use of R's taxize library, and it should highlight when there are multiple name hits and force the user to select one.

I think in these cases, the user just happened to select the wrong one, and I can provide a better tutorial and explanation for how a user should go about making that decision on this page.

jhpoelen commented 3 years ago

@ahhurlbert Thanks for sharing your details on your methods for taxonomic name cleaning in your long standing collaborative project (Avian Diet Database) to transcribe bird diets from literature . I figured you had some practices in place to detect and deal with homonyms: using rOpenSci's taxize and ITIS's taxonomic identifiers.

Over the course of my involvement with GloBI, I've seen many groups transcribe impressive amounts of literature in their own specific way and make the transcribed papers (partly) machine readable.

I wonder what would happen if there'd be a meeting (or series of meetings) with a focus on sharing best practices on transcribing (or liberating) species interaction from literature in context of a diverse group of contributors (e.g., undergrads, grads, postdocs, PIs, citizen scientists) .

fyi @seltmann @jhammock @katjaschulz @jsbarnes @dschigel @qgroom @n8upham @myrmoteras @mdmtrv

Please let me know if you are interested to join/contribute to join for a species interaction transcription workshop series highlighting the various past and ongoing projects and methods used to tease interaction claims from old fashioned literature.

I'd be happy to facilitate where I can.