gnames / gnfinder

GNfinder finds scientific names in UTF8 texts, PDF files, MS Word/Excel documents, URLs etc.
MIT License
44 stars 5 forks source link

Consider parsing "Untergattung" #126

Closed Archilegt closed 2 years ago

Archilegt commented 2 years ago

Searching in BHL’s full text for “Untergattung” retrieves 8675 publications and searching for “Untergatt.” retrieves 541 publications [22.02.2022]. https://www.biodiversitylibrary.org/search?stype=F&searchTerm=Untergattung#/titles https://www.biodiversitylibrary.org/search?stype=F&searchTerm=Untergatt.#/titles I don't know how to visualize total hits in the corpus.

Archilegt commented 2 years ago

Related: Names of subgenera don't get parsed if subgen. is included in the scientific name value gnames/gnparser#232 recognizing "species group" or "species complex" suffixes as indicators of infrageneric groupings gnames/gnparser#55

Synergic with: Use "mihi" to enhance scientific name finding and parsing gnames/gnparser#230

Archilegt commented 2 years ago

Example Julus (Parastenophyllum) Verhoeff, 1899 [original name] https://myriatrix.myspecies.info/myriatrix/julus-parastenophyllum

Original string: Gatt. Julus, Untergatt. Parastenophyllum mihi Source: https://www.biodiversitylibrary.org/page/15115029

Remarks: Name strings “Julus” and “Parastenophyllum” are recognized. The styling of the subgenus name in the paper is really bad when compared to that of subgenus Julus (Leptoiulus) on page 199.

Suggested recognition: Gatt. acts as a starter #optional Untergatt. acts as a starter and/or connector #could be read and used to generate a field subgenus: Parastenophyllum mihi acts as terminator #recommended

Suggested result: Recognized name to be shown in "Scientific Names on this Page" box: Julus (Parastenophyllum)

Original string: Gatt. Julus, Untergatt. Parastenophyllum mihi #similar to comment. Normalization to canonical form: short version: Parastenophyllum full version: Julus (Parastenophyllum) #Parentheses are important here as per article 6.1 of the ZooCode.

If this "German issue" is implemented, we can definitely include it in the Verhoeff paper GNA module.

dimus commented 2 years ago

I wouls say this is also closer to gnfinder realm. I will move this issue there.

dimus commented 2 years ago

I did run the search for Untergattung through all BHL corpus and found that the word happens quite rare and quite often is not connected to immediate scientific name. A check for the word would significantly decrease efficiency of the seach. Such minor improvements accumulating with time would slow down gnfinder to a halt and make it useless for large data processing.

In case of mihi: we would check for it only if we already know something is a scientific name, so it wont change performance significantly.

dimus commented 2 years ago

Anchor words like Untrgattung will be important for NLP analysis to weed out false positives when a scientific word is ambivalent like Cancer or America.