globalbioticinteractions / name-alignment-template

align names with known taxonomic resources
https://big-bee-network.github.io/name-alignment-workshop
Creative Commons Zero v1.0 Universal
3 stars 6 forks source link

Add higher taxa authority information as the default. #18

Closed whitfarnum closed 6 months ago

whitfarnum commented 6 months ago

I would like the results to included the alignedAuthority for higher taxonomy. It would allow me to add authors to our local datasets with less work. I currently have to do it in two passes where I query the higher taxa on there own to extract the author and year.

The current results are

alignedName: Adoretus alignedAuthority: Dejean, 1833 alignedSubfamilyName: Rutelinae alignedTribeName: Adoretini

I would like a fields like alignedSubfamilyNameAuthority: Smith, 1900 alignedTribeNameAuthority: Jones, 1901 this would mean I only need to run Nomer once and I can get all the information I need.

jhpoelen commented 6 months ago

@whitfarnum thanks for your suggestion, I much like your idea. Adding authorities for the higher order taxa should take me about 2-4 hours to implement (pessimistic estimate), including preparing a release and testing etc. Any other (java) developer would be able to do this also, perhaps with a little bit of learning curve. Any idea on how to materialize your neat proposed feature?

whitfarnum commented 6 months ago

@jhpoelen I was implementing via a hack of submitting species names to get the current higher taxonomy. I then isolated the higher taxa and just submitted those to Nomer. When I submit the higher taxa I get the authorities. I then stitched them together via dictionaries. everything I do is in Python.

jhpoelen commented 6 months ago

Ok, neat to see that you are being creative and sharing ideas to improve nomer. I can see how adding this nomer feature would save you time.

Do you guys have a budget to support development of open source tools like Nomer? If not, I suggest you look into that, because I can't sustain working pro bono especially when working for fancy institutions like yours. If so, please let me know how you'd like to compensate for my time.

jhpoelen commented 6 months ago

After about four hours since you first shared your idea, I was able to come up with the following (working) example with about 2-3 hours of development/testing/deployment:

echo -e "\tHomo sapiens"\
 | nomer append --include-header ncbi\
 | mlr --itsvlite --oxtab cat

produced the data below (note the populated resolvedPathAuthorships).

@whitfarnum is this what you had in mind?

providedExternalId      
providedName            Homo sapiens
relationName            SAME_AS
resolvedExternalId      NCBI:9606
resolvedName            Homo sapiens
resolvedAuthorship      Linnaeus, 1758
resolvedRank            species
resolvedCommonNames     
resolvedPath            root | cellular organisms | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Euarchontoglires | Primates | Haplorrhini | Simiiformes | Catarrhini | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens
resolvedPathIds         NCBI:1 | NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:8287 | NCBI:1338369 | NCBI:32523 | NCBI:32524 | NCBI:40674 | NCBI:32525 | NCBI:9347 | NCBI:1437010 | NCBI:314146 | NCBI:9443 | NCBI:376913 | NCBI:314293 | NCBI:9526 | NCBI:314295 | NCBI:9604 | NCBI:207598 | NCBI:9605 | NCBI:9606
resolvedPathNames       |  | superkingdom | clade | kingdom | clade | clade | clade | phylum | subphylum | clade | clade | clade | clade | superclass | clade | clade | clade | class | clade | clade | clade | superorder | order | suborder | infraorder | parvorder | superfamily | family | subfamily | genus | species
resolvedPathAuthorships |  |  | Cavalier-Smith 1987 |  |  |  |  |  |  | Cuvier, 1812 |  |  |  |  |  |  |  |  | Parker & Haswell, 1897 |  |  |  | Linnaeus, 1758 |  |  |  |  | Gray, 1825 |  | Linnaeus, 1758 | Linnaeus, 1758
resolvedExternalUrl     https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
whitfarnum commented 6 months ago

@jhpoelen yes that is the information I had in mind.

whitfarnum commented 6 months ago

working pro bono especially when working for fancy institutions like yours. If so, please let me know how you'd like to compensate for my time.

I will bring this subject up with my supervisor.

jhpoelen commented 6 months ago

Just curious - which Nomer "matcher" or taxonomic resource do you typically use?

jhpoelen commented 6 months ago

Here's an example of associated ITIS results

providedExternalId      
providedName            Adoretus
relationName            HAS_ACCEPTED_NAME
resolvedExternalId      ITIS:187484
resolvedName            Adoretus
resolvedAuthorship      Dejean, 1833
resolvedRank            genus
resolvedCommonNames     
resolvedPath            Animalia | Bilateria | Protostomia | Ecdysozoa | Arthropoda | Hexapoda | Insecta | Pterygota | Neoptera | Holometabola | Coleoptera | Polyphaga | Scarabeiformia | Scarabaeoidea | Scarabaeidae | Rutelinae | Adoretini | Adoretus
resolvedPathIds         ITIS:202423 | ITIS:914154 | ITIS:914155 | ITIS:914158 | ITIS:82696 | ITIS:563886 | ITIS:99208 | ITIS:100500 | ITIS:563890 | ITIS:914213 | ITIS:109216 | ITIS:112747 | ITIS:678302 | ITIS:114486 | ITIS:114493 | ITIS:678509 | ITIS:926256 | ITIS:187484
resolvedPathNames       kingdom | subkingdom | infrakingdom | superphylum | phylum | subphylum | class | subclass | infraclass | superorder | order | suborder | infraorder | superfamily | family | subfamily | tribe | genus
resolvedPathAuthorships ITIS:AUTHORSHIP:0 | ITIS:AUTHORSHIP:0 | ITIS:AUTHORSHIP:0 | ITIS:AUTHORSHIP:0 | ITIS:AUTHORSHIP:0 | ITIS:AUTHORSHIP:0 | ITIS:AUTHORSHIP:0 | ITIS:AUTHORSHIP:0 | ITIS:AUTHORSHIP:0 | ITIS:AUTHORSHIP:0 | Linnaeus, 1758 | Emery, 1886 | Crowson, 1960 | Latreille, 1802 | Latreille, 1802 | MacLeay, 1819 | Burmeister, 1844 | Dejean, 1833
resolvedExternalUrl     http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=187484
whitfarnum commented 6 months ago

Currently we use Catalog of Life because they have the scarab database. I am currently curating and inventorying our scarabs. Once we are done with scarabs the plan is Carabidae since COL also has the definitive Carabidae catalog at the moment. We are pretty much prioritizing our work to focus on groups that have good online resources so we can leverage that research. I am considering writing up our uses of Nomer as a curation tool if I ever have time. I will contact you about authorship if it happens. I know this is not your envisioned use case but it has been a huge time saver. It has easily taken months of this process.

jhpoelen commented 6 months ago

It has easily taken months of this process.

That is great to hear that Nomer saved you quite some time. I have to say that Nomer is what it is today because of folks like yourself - not shy to try out a new methods and open to sharing ideas for improvement.

(am re-building Catalogue of Life index with the latest dev version of Nomer as we speak, stay tuned . . . )

jhpoelen commented 6 months ago

@whitfarnum here's the recently built Catalogue of Life results. Please note that the Catalogue of Life version is the one whose origin and content is packaged in the "Nomer Corpus of Taxonomic Resources" [1]. Is this result as you expected?

echo -e "\tAdoretus"\
 | nomer append --include-header col\
 | mlr --itsvlite --oxtab cat

yielded -

providedExternalId      
providedName            Adoretus
relationName            HAS_ACCEPTED_NAME
resolvedExternalId      COL:PCX
resolvedName            Adoretus
resolvedAuthorship      Dejean, 1833
resolvedRank            genus
resolvedCommonNames     
resolvedPath            Biota | Animalia | Arthropoda | Insecta | Coleoptera | Scarabaeoidea | Scarabaeidae | Rutelinae | Adoretini | Adoretina | Adoretus
resolvedPathIds         COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:C2L | COL:SC | COL:6278C | COL:K9Y | COL:KJT | COL:LBJ | COL:PCX
resolvedPathNames       unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | subtribe | genus
resolvedPathAuthorships |  |  |  |  |  | Latreille, 1802 | MacLeay, 1819 | Burmeister, 1844 | Burmeister, 1844 | Dejean, 1833
resolvedExternalUrl     https://www.catalogueoflife.org/data/taxon/PCX

References

[1] Poelen, J. H. (ed . ) . (2024). Nomer Corpus of Taxonomic Resources hash://sha256/d2903d0384a8b8193819b8061c8c4e6fec8cc2f7fe72dc0e91c90c07ba2fe15e hash://md5/70645090fdecba640b50577e2a6f2342 (0.23) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10810821

jhpoelen commented 6 months ago

hey @whitfarnum -

I've released Nomer v0.5.8 with support for the authorship by rank fields.

Please find attached alignment-report.zip as well as the first entry in the report, expressed in xtabs optimized for vertical viewing in the text box below.

Note the various authorship entries by rank -

alignedOrderName            Coleoptera
alignedOrderId              ITIS:109216
alignedOrderAuthorship      Linnaeus, 1758
alignedFamilyName           Scarabaeidae
alignedFamilyId             ITIS:114493
alignedFamilyAuthorship     Latreille, 1802

Including release, testing, communication etc. this improvement took about 6 hours to complete. Now the big question is - what is the feature worth . . . curious to hear what your supervisor says about the benefit of having this feature/tool vs additional time spent when not having this feature/tool.

Please review and let me know if this implements your desired functionality.

alignment-report.zip

providedExternalId          
providedName                Adoretus
parseRelation               SAME_AS
parsedExternalId            
parsedName                  Adoretus
parsedAuthority             
parsedRank                  
parsedCommonNames           
parsedPath                  
parsedPathIds               
parsedPathNames             
parsedPathAuthorships       
parsedNameSource            gbif-parse
parsedNameSourceUrl         https://linker.bio,https://zenodo.org/records/10810821/files,https://zenodo.org/records/10045382/files,https://zenodo.org/records/10037817/files,https://zenodo.org/records/8327611/files
parsedNameSourceAccessedAt  hash://sha256/d2903d0384a8b8193819b8061c8c4e6fec8cc2f7fe72dc0e91c90c07ba2fe15e
alignRelation               HAS_ACCEPTED_NAME
alignedCatalogName          itis
alignedExternalId           http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=187484
alignedName                 Adoretus
alignedAuthorship           Dejean, 1833
alignedRank                 genus
alignedCommonNames          
alignedKingdomName          Animalia
alignedKingdomId            ITIS:202423
alignedKingdomAuthorship    
alignedPhylumName           Arthropoda
alignedPhylumId             ITIS:82696
alignedPhylumAuthorship     
alignedClassName            Insecta
alignedClassId              ITIS:99208
alignedClassAuthorship      
alignedOrderName            Coleoptera
alignedOrderId              ITIS:109216
alignedOrderAuthorship      Linnaeus, 1758
alignedFamilyName           Scarabaeidae
alignedFamilyId             ITIS:114493
alignedFamilyAuthorship     Latreille, 1802
alignedSubfamilyName        Rutelinae
alignedSubfamilyId          ITIS:678509
alignedSubfamilyAuthorship  MacLeay, 1819
alignedTribeName            Adoretini
alignedTribeId              ITIS:926256
alignedTribeAuthorship      Burmeister, 1844
alignedSubtribeName         
alignedSubtribeId           
alignedSubtribeAuthorship   
alignedGenusName            Adoretus
alignedGenusId              ITIS:187484
alignedGenusAuthorship      Dejean, 1833
alignedSubgenusName         
alignedSubgenusId           
alignedSubgenusAuthorship   
alignedSpeciesName          
alignedSpeciesId            
alignedSpeciesAuthorship    
alignedSubspeciesName       
alignedSubspeciesId         
alignedSubspeciesAuthorship 
alignedPath                 Animalia | Bilateria | Protostomia | Ecdysozoa | Arthropoda | Hexapoda | Insecta | Pterygota | Neoptera | Holometabola | Coleoptera | Polyphaga | Scarabeiformia | Scarabaeoidea | Scarabaeidae | Rutelinae | Adoretini | Adoretus
alignedPathIds              ITIS:202423 | ITIS:914154 | ITIS:914155 | ITIS:914158 | ITIS:82696 | ITIS:563886 | ITIS:99208 | ITIS:100500 | ITIS:563890 | ITIS:914213 | ITIS:109216 | ITIS:112747 | ITIS:678302 | ITIS:114486 | ITIS:114493 | ITIS:678509 | ITIS:926256 | ITIS:187484
alignedPathNames            kingdom | subkingdom | infrakingdom | superphylum | phylum | subphylum | class | subclass | infraclass | superorder | order | suborder | infraorder | superfamily | family | subfamily | tribe | genus
alignedPathAuthorships      |  |  |  |  |  |  |  |  |  | Linnaeus, 1758 | Emery, 1886 | Crowson, 1960 | Latreille, 1802 | Latreille, 1802 | MacLeay, 1819 | Burmeister, 1844 | Dejean, 1833
alignedNameSource           itis
alignedNameSourceUrl        https://linker.bio,https://zenodo.org/records/10810821/files,https://zenodo.org/records/10045382/files,https://zenodo.org/records/10037817/files,https://zenodo.org/records/8327611/files
alignedNameSourceAccessedAt hash://sha256/d2903d0384a8b8193819b8061c8c4e6fec8cc2f7fe72dc0e91c90c07ba2fe15e
jhpoelen commented 6 months ago

@whitfarnum let me know when you got a chance to confirm that the newly release Nomer has the functionality you were hoping for.

Apologies for my banter on things related to funding of Nomer activities. Many folks have been generous in the past, and I am hoping to simply open the door to funding while keeping continuously improving our tools and keeping them openly accessible. Thanks for understanding my earlier opportunistic statements.

jhpoelen commented 6 months ago

@whitfarnum let me if you have additional notes / questions, otherwise, I'll consider this issue closed.