gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
40 stars 5 forks source link

Use "mihi" to enhance scientific name finding and parsing #230

Closed Archilegt closed 2 years ago

Archilegt commented 2 years ago

The Latin word "mihi" was used by authors when proposing new scientific names, with the meaning of "me". The word could be used as a marker for "scientific name ends here", and could enhance scientific name finding if coupled to "search for scientific name 1, 2, 3 words ahead". The word could also be used for adding "interpreted authorship" (author+date) to scientific names instances if coupled to the publication (book, article) metadata where the scientific name instance is matched, therefore potentially helping to disambiguate homonyms. A quick glance at the occurrence of the word in BHL: https://www.biodiversitylibrary.org/search?searchTerm=mihi&stype=F#/titles Maybe it would be worth trying at least the "scientific name ends here" suggestion? :)

dimus commented 2 years ago

Searching gnverifier database got 20 names with mihi:

Anisochaeta kiwi mihi Blakemore 2012
 Aeolesthes inhirsutus mihi
 Bruchus nongoniermani Mihi,
 Anisochaeta kiwi mihi
 Anisochaeta kiwi mihi Blakemore, 2013
 Chyphononyx simulator mihi
 Chimila tinguana mihi
 Cobosidea mihi
 Lithobius leostygis mihi
 Conferva geminata var. mihi
 Eucyclops serrulatus mihi
 Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953
 Conferva geminata var. mihi Schwabe
 Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
 Lithobius (Polybothrus) leostygis subsp. mihi
 Quexua alinella mihi
 Lithobius (Polyrbothrus) caesar subsp. mihi
 Odonthophagus var. c mihi
 Scutella agassizi mihi
 Trochus patholatus mihi
dimus commented 2 years ago

Looks like mihi word has several meanings:



Conferva geminata var. mihi
Conferva geminata var. mihi Schwabe
AlgaeBase
Eukaryota unassigned phylum|Eukaryota unassigned class|Eukaryota unassigned order||Conferva|Conferva geminata mihi

Conferva geminata var. mihi Schwabe
Conferva geminata var. mihi Schwabe
AlgaeBase
Eukaryota unassigned phylum|Eukaryota unassigned class|Eukaryota unassigned order|Conferva|Conferva geminata mihi

Eucyclops serrulatus mihi
Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
Catalogue of Life
Biota|Animalia|Arthropoda|Hexanauplia|Copepoda|Neocopepoda|Podoplea|Cyclopoida|Cyclopida|Cyclopidae|Eucyclops|Eucyclops serrulatus serrulatus|Eucyclops serrulatus mihi

Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
Catalogue of Life
Biota|Animalia|Arthropoda|Hexanauplia|Copepoda|Neocopepoda|Podoplea|Cyclopoida|Cyclopida|Cyclopidae|Eucyclops|Eucyclops serrulatus serrulatus|Eucyclops serrulatus mihi

Aeolesthes inhirsutus mihi
Aeolesthes inhirsutus mihi
EOL

Chyphononyx simulator mihi
Chyphononyx simulator mihi
EOL

Chimila tinguana mihi
Chimila tinguana mihi
EOL,

Quexua alinella mihi
Quexua alinella mihi
EOL

Cobosidea mihi
Cobosidea mihi
ION

Odonthophagus var. c mihi
Odonthophagus
ION

Scutella agassizi mihi
Scutella agassizi mihi
ION

Trochus patholatus mihi
Trochus patholatus mihi
ION

Lithobius leostygis mihi
Lithobius (Polybothrus) leostygis subsp. mihi
Plazi

Lithobius (Polybothrus) leostygis subsp. mihi
Lithobius (Polybothrus) leostygis subsp. mihi
Plazi

Lithobius (Polyrbothrus) caesar subsp. mihi
Lithobius (Polyrbothrus) caesar subsp. mihi
Plazi

Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953
Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953
Union 4
|Cellular life|Eukaryota|Opisthokonts|Fungi|Fungi|Ascomycota|Sordariomycetes|Hypocreales|Hypocreaceae|Hypomyces|Hypomyces chrysospermus edulis-mihi

Anisochaeta kiwi mihi Blakemore 2012
Anisochaeta kiwi mihi Blakemore, 2013
WoRMS
Biota|Animalia|Annelida|Clitellata|Oligochaeta|Crassiclitellata|Megascolecida|Megascolecidae|Anisochaeta|Anisochaeta kiwi|Anisochaeta kiwi mihi

Anisochaeta kiwi mihi
Anisochaeta kiwi mihi Blakemore 2013
WoRMS
Biota|Animalia|Annelida|Clitellata|Oligochaeta|Crassiclitellata|Megascolecida|Megascolecidae|Anisochaeta|Anisochaeta kiwi|Anisochaeta kiwi mihi

Anisochaeta kiwi mihi Blakemore, 2013
Anisochaeta kiwi mihi Blakemore, 2013
WoRMS
Biota|Animalia|Annelida|Clitellata|Oligochaeta|Crassiclitellata|Megascolecida|Megascolecidae|Anisochaeta|Anisochaeta kiwi|Anisochaeta kiwi mihi

Bruchus nongoniermani Mihi
Bruchus nongoniermani Mihi
uBio NameBank
Bruchus nongoniermani
dimus commented 2 years ago

I dont worry about Union, uBio, ION, and EOL, they are not human-curated, but AlgaeBase, CoL and WoRMS seem to have names with legitimate use of mihi as epithets. So parser should take at least these names as exceptions to the rule

Archilegt commented 2 years ago

Many thanks, Dima! Good to know that if "mihi" is applied, it may give "false positives" in a very small subset of names, compared to the "true positives" for which it does represent a terminal element.

Name deduplication: I believe that for the sake of counting potentially affected names, the 20 name instances that you found can be deduplicated down to 15, as follows:

Deduplicated list of names:

  1. Aeolesthes inhirsutus mihi
  2. Anisochaeta kiwi mihi Blakemore 2012
  3. Bruchus nongoniermani Mihi,
  4. Chimila tinguana mihi
  5. Chyphononyx simulator mihi
  6. Cobosidea mihi
  7. Conferva geminata var. mihi Schwabe
  8. Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
  9. Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953
  10. Lithobius leostygis mihi and Lithobius (Polybothrus) leostygis subsp. mihi
  11. Lithobius (Polyrbothrus) caesar subsp. mihi
  12. Odonthophagus var. c mihi
  13. Quexua alinella mihi
  14. Scutella agassizi mihi
  15. Trochus patholatus mihi
Archilegt commented 2 years ago

Names by Plazi:

Scientific name: Lithobius (Polyrbothrus) caesar mihi https://tb.plazi.org/GgServer/html/299583C14F747A72E86065049FDE3C22 A misspelling for Polybothrus, plus a digitization artifact which should not have included "mihi". Published string is spelled and styled correctly, as "4. Lithobius (Polybothrus) caesar mihi." See https://www.biodiversitylibrary.org/page/13294205

Scientific name: Lithobius (Polybothrus) leostygis subsp. mihi https://tb.plazi.org/GgServer/html/CCEB9C62C87766E980DD858BC13468C8 A digitization artifact which should not have included "mihi". Published string is styled correctly, as "1. Lithobius (Polyhothrus) leostygis mihi". See See https://www.biodiversitylibrary.org/page/13294201

Scientific name: Lithobius leostygis mihi This instance points to the one above and I could not find a URL for it.

Result: The three (two when deduplicated) scientific name instances contributed by Plazi are false-positive digitization artifacts, including a misspelling.

Deduplicated list of names v.2 (Plazi names cleared):

  1. Aeolesthes inhirsutus mihi
  2. Anisochaeta kiwi mihi Blakemore 2012
  3. Bruchus nongoniermani Mihi,
  4. Chimila tinguana mihi
  5. Chyphononyx simulator mihi
  6. Cobosidea mihi
  7. Conferva geminata var. mihi Schwabe
  8. Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
  9. Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953
  10. Odonthophagus var. c mihi
  11. Quexua alinella mihi
  12. Scutella agassizi mihi
  13. Trochus patholatus mihi
Archilegt commented 2 years ago

Anomaly: The name "Odonthophagus var. c mihi", coming from ION has so many anomalies that it seems irrelevant to GNA for name finding. Source anomalies: The generic name is given as both "Onthophagus" (https://www.biodiversitylibrary.org/page/8222096) and "Odonthophagus" (https://www.biodiversitylibrary.org/page/8221999) in the "Enumeratio Insectorum Norvegicorum. Fasciculus ii." which ION points to. Additionally, it is not a scientific name in itself, e.g., it is the name of a variety designated by a single letter. Digitization anomalies: Name digitized with the genus "Odonthophagus" instead of ""Onthophagus". Name not including the specific epithet, supposedly "fracticornis", to which "var. c" is to be ascribed. The "mihi" seems to be a false positive, added by the recorder, as it is not a text string in the referred publication.

Overall, the name can be considered a false positive for mihi and can be deleted from the list.

Deduplicated list of names v.3:

  1. Aeolesthes inhirsutus mihi
  2. Anisochaeta kiwi mihi Blakemore 2012 (true positive, a regrettable alternative to mihiensis)
  3. Bruchus nongoniermani Mihi,
  4. Chimila tinguana mihi
  5. Chyphononyx simulator mihi
  6. Cobosidea mihi
  7. Conferva geminata var. mihi Schwabe (AlgaeBase, taxon?)
  8. Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
  9. Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953 (Union 4, Fungi: Ascomycota)
  10. Quexua alinella mihi
  11. Scutella agassizi mihi
  12. Trochus patholatus mihi

@dimus, could someone check the "algal" and fungal names for you, so that we can know if they are true or false positives? A copy of the original publication would be desirable.

dimus commented 2 years ago

Word mihi happens 192254 times in BHL

Conferva geminata var. mihi Schwabe: https://verifier.globalnames.org/?capitalize=on&format=html&names=Conferva+geminata+var.+mihi+Schwabe https://www.algaebase.org/search/species/detail/?species_id=93703

edulus-mihi is not a problem, so I do not worry about it

Archilegt commented 2 years ago

"Conferva geminata var. mihi Schwabe" may be hard to match. The combination is uncurated in AlgaeBase and there is no guarantee that it is an original combination. There are no recorded references for that combination. The original combination may be Oscillatoria geminata Schwabe. When searched for that combination and author, AlgaeBase returns "Oscillatoria geminata Schwabe ex Gomont 1892" (https://www.algaebase.org/search/species/detail/?species_id=51094), which is also not the original treatment. The original treatment for Oscillatoria geminata Schwabe can be found at: Linnaea 11 (1), year 1837 Page 118: https://www.biodiversitylibrary.org/page/35312749 Tab. 1, Fig. 7: https://www.biodiversitylibrary.org/page/35313360

Confirming whether these are two combinations of the same name and whether the "mihi" is an artifact would require consulting with specialists familiar with the historical literature on Conferva and Oscillatoria. However, that is likely the case, as the author matches and there are currently combinations under both genera for a few species.

dimus commented 2 years ago

So my understanding is that really we have only these known exceptions for the parsing rule:

Anisochaeta kiwi mihi Blakemore 2012 
Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
Archilegt commented 2 years ago

Aeolesthes inhirsutus mihi seems another false positive. The name string "Aeolesthes inhirsutus subsp. mihi M.Matsushita, 1932" is deleted from GBIF (https://www.gbif.org/species/8885942). The string may have reached GBIF via JBIF (Japan). See entry for holotype of "Aeolesthes inhirsutus subsp. mihi M.Matsushita, 1932" at https://www.gbif.jp/gbif_search/detail?id=1_sehu-cole_urn:catalog:SEHU:COLE:0000000191

Archilegt commented 2 years ago

About "Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966" Dussart, Bernard; François Graf; and Roger Husson. 1966. Les Crustacés du réservoir de la Fontaine des Suisses à Dijon. International Journal of Speleology, 2: 269-281. http://dx.doi.org/10.5038/1827-806X.2.3.2

The "author" is only Dussart, as he is the sole responsible for Copepoda in that publication. The name string "Eucyclops serrulatus var. mihi" is apparently styled correctly (pages 270 and 278). However, this is a printing artifact which became a database artifact. Dussart stated on pp. 270-271 (translated): "The differences existing between these two forms are not sufficient to give a name to the variety with the spine of P5 slender. I need only mention its existence...". Also, as per the first edition of the International Code of Zoological Nomenclature (1961), "Article 15. Names published after 1960. — After 1960, a new name proposed conditionally, or one proposed explicitly as the name of a "variety" or "form" [Art. 45e], is not available." (https://www.biodiversitylibrary.org/page/34584570). This further points at an unnamed form by Dussart (1966), the "mihi" in this case also being a false positive that does not need to be added to the exceptions, at least from the nomenclatural point of view.

dimus commented 2 years ago

Hmm, looks like situation is even more interesting with mihi:

https://www.biodiversitylibrary.org/item/181042#page/535/mode/1up

Characium obovatum mihi. b. var. longipes mihi

I wonder if a better approach to mihi is to ignore it, instead of considering it the end of a name. But for gnfinder the use of mihi as a name terminator word might work.

dimus commented 2 years ago

Thank you @Archilegt for interesting information aboutEucyclops serrulatus mihi, I'll pass it along to CoL guys. Do I understand correctly, that in zoology old names with var. or f. sometimes are promoted to subspecies rank? I would still add Eucyclops serrulatus mihi as an exception, because parser is not a nomenclatural authority and deals with data on a lexical level.

Archilegt commented 2 years ago

Hi @dimus I reported the issue with E. s. mihi to T. Chad Walter (https://www.marinespecies.org/copepoda/index.php) on 13.vi.2022 but I did not receive a reply. Maybe the COL will be able to reach him or someone else. Thanks!

Archilegt commented 2 years ago

Hi @dimus The case of Characium obovatum mihi. b. var. longipes mihi (https://www.biodiversitylibrary.org/page/47100016) is interesting. There you don't have one name but two. The string would be parsed by a human reader as: Tab. VII. Fig. 3. Characium obovatum mihi Fig. 3b. Characium obovatum var. longipes mihi where "b" is not part of the name but the explanation of an illustration (https://www.biodiversitylibrary.org/page/47100082). The two mihi are indeed to be parsed as terminators but the first one could be also recognized as a connector. Detecting and reconstructing two names and recognizing a "b" as a figure indication might be too much to ask from a parser and could be left to a layer of annotations. For strings less complex (e.g., without the "b") and containing two mihi, where Genus specificEpithet mihi [var., f.] subspecificEpithet mihi the parsing would be:

if 2 mihi, 
parse mihi 1, 
connect specificEpithet to subspecificEpithet, 
terminate before mihi 2

"...in zoology old names with var. or f. sometimes are promoted to subspecies rank?" Yes, you are correct. The ZooCode has article "45. The species group", where article "45.5. Infrasubspecific names." The references therein will guide you to other articles.

Archilegt commented 2 years ago

Hi @dimus Shall we keep this issue open for some preliminary reporting on improved parsing? Or shall we do that via email or GoogleDocs? It would be great to have some stats on the actual improvement of the parser! :D

dimus commented 2 years ago

I do not have yet b. var. as a possible rank (not yet sure how common it is, to justify adding it to parsing). The parsing of Characium obovatum mihi. var. longipes mihi is now Characium obovatum var. longipes:

https://github.com/gnames/gnparser/blob/master/testdata/test_data.md#names-with-mihi

I think it is reasonable enough to close the ticket for now, especially because the parser does not deal with names that happen in biological texts, and it is extremely rare to have mihi in prepared lists of names.

If more concerns will appear about mihi we can make a new ticket and link it with this one.

Archilegt commented 2 years ago

Dima, please note that b. var. is not a rank. b refers to figure 3b var. is a rank

dimus commented 2 years ago

Ah thank you for spotting it @Archilegt!

Dima, please note that b. var. is not a rank. b refers to figure 3b var. is a rank

Making gnfinder ticket about it https://github.com/gnames/gnfinder/issues/125

Archilegt commented 2 years ago

Ok. If the parsing of Characium obovatum mihi. var. longipes mihi is now Characium obovatum var. longipes, we can mention it as a special case of limitation of the parser, in which one string representing two names (one species, one subspecies) is parsed only to the subspecific name. We don't have to solve all the parsing problems in this round. ;-)

dimus commented 2 years ago

@Archilegt, do you think it makes better sense to parse Characium obovatum mihi. var. longipes mihi as Characium obovatum with var. longipes mihi as an unparseable tail? The parser does assume that a string must have only one name.

I tend to think about this string as an indication of implicit authorship in two places, kind of similar to Aus bus L. cus K.

Archilegt commented 2 years ago

"do you think it makes better sense to parse Characium obovatum mihi. var. longipes mihi as Characium obovatum with var. longipes mihi as an unparseable tail? The parser does assume that a string must have only one name." No, I think that when choosing among two name strings, one should aim at retrieving the longest and most informative string along with the shortest unparseable tail. As it is now.

"I tend to think about this string as an indication of implicit authorship in two places, kind of similar to Aus bus L. cus K." Yes, that would be the case for Characium obovatum mihi. var. longipes mihi. However, here we have Fig 3. Characium obovatum mihi. b. var. longipes mihi In an ideal world, the parser would:

  1. Execute a first parsing, with Fig. #langEn or Abb. #langDE followed by Arabic or Roman numerals ranking higher than scientificName. If Fig. or Abb. and numerals are detected, parse accordingly and wrap the whole string or substrings as explanationOfFigure
  2. Execute a second parsing for #ordered letters where #a can be ommitted and scoring letters higher if they are #letters enclosed by periods. Wrap resulting explanationOfSubfigure.
  3. Trigger name detection within each explanationOfSubfigure, with allowed values for single words specificEpithet and subspecificEpithet. Increase posterior score for explanationOfSubfigure wrappers if mihi terminators or authorName co-occur with periods of #ordered letters.
  4. Trigger name reconnection for explanationOfSubfigure values b to z if single word values specificEpithet and subspecificEpithet exist. Match subspecificEpithet to nearest anterior specificEpithet, match both to nearest anterior genus in order to assemble scientificName.

Example for Fig 3. Characium obovatum mihi. b. var. longipes mihi:

  1. <explanationOfFigure>Fig 3. Characium obovatum mihi. b. var. longipes mihi</explanationOfFigure> #langEn #numeralArabic

  2. <explanationOfFigure>Fig 3. <explanationOfsubfigure>Characium obovatum mihi.</explanationOfsubfigure> #aOmmitted #wrapperScore = 0.25 <explanationOfsubfigure>b. var. longipes mihi</explanationOfsubfigure> #bFirstLetter #wrapperScore = 0.25 </explanationOfFigure>

  3. <explanationOfFigure>Fig 3. <explanationOfsubfigure>Characium obovatum mihi.</explanationOfsubfigure> #mihi #wrapperPostScore = 0.50 <explanationOfsubfigure>b. var. longipes mihi</explanationOfsubfigure> #mihi #wrapperPostScore = 0.50 #subspecificEpithet = true </explanationOfFigure>

  4. <explanationOfFigure>Fig 3. <explanationOfsubfigure>Characium obovatum mihi.</explanationOfsubfigure> #scientificName = Characium obovatum <explanationOfsubfigure>b. var. longipes mihi</explanationOfsubfigure> #bFirstLetter #scientificNameAssembled = Characium obovatum var. longipes </explanationOfFigure>

Does it make sense?

dimus commented 2 years ago

I think what you say is more of a job for gnfinder, because gnparser is designed to work with lists of already processed scientific names like personal checklists, databases, already extracted names. Adding contraints on what gnparser can do allows to decrease the number of false positives.

Lets say Characium obovatum mihi. b. var. longipes mihi is in a database. Parser would return:

http://parser.globalnames.org/?format=html&names=Characium+obovatum+mihi.+b.+var.+longipes+mihi&with_details=on

with lowest parsing quality 4 and 2 warnings: unparsed tail and ignored annotation, which would allow database or checklist curator to detect a problem, look at it and fix it by hand

{
  "parsed": true,
  "quality": 4,
  "qualityWarnings": [
    {
      "quality": 4,
      "warning": "Unparsed tail"
    },
    {
      "quality": 3,
      "warning": "Ignored annotation `mihi`"
    }
  ],
  "verbatim": "Characium obovatum mihi. b. var. longipes mihi",
  "normalized": "Characium obovatum",
  "canonical": {
    "stemmed": "Characium obouat",
    "simple": "Characium obovatum",
    "full": "Characium obovatum"
  },
  "cardinality": 2,
  "tail": " b. var. longipes mihi",
  "details": {
    "species": {
      "genus": "Characium",
      "species": "obovatum"
    }
  },
  "words": [
    {
      "verbatim": "Characium",
      "normalized": "Characium",
      "wordType": "GENUS",
      "start": 0,
      "end": 9
    },
    {
      "verbatim": "obovatum",
      "normalized": "obovatum",
      "wordType": "SPECIES",
      "start": 10,
      "end": 18
    }
  ],
  "id": "e65f7279-c3f1-5719-9058-a3c024719fde",
  "parserVersion": "v1.6.7"
}