gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

Improve parsing of complex author strings #184

Closed KatjaSchulz closed 3 years ago

KatjaSchulz commented 3 years ago

Some names with complex author strings don’t get parsed properly, resulting in part of the author string being interpreted as an epithet. I have seen this happen if the following declarations are included in the author string: non|nec|fide|vide|ms

Examples:

names string > gnparser FullCanonical Eulima excellens Verkrüzen fide Paetel, 1887 > Eulima excellens fide Amathia tricornis Busk ms in Chimonides, 1987 > Amathia tricornis ms Crisia eburneodenticulata Smitt ms in Busk, 1875 > Crisia eburneodenticulata ms Procamallanus (Spirocamallanus) soodi Lakshmi & Kumari, 2001 nec (Gupta & Masood, 1988) > Procamallanus soodi nec Membranipora minuscula Canu, 1911 non Hincks, 1882 > Membranipora minuscula non Hornera radians Defrance, 1821 non (Lamarck, 1816) > Hornera radians non Hornera verrucosa Reuss, 1851 non Reuss, 1848 > Hornera verrucosa non Crisina excavata (d'Orbigny, 1853) non (d'Orbigny, 1853) > Crisina excavata non Proboscina subechinata Canu & Bassler, 1920 non d'Orbigny, 1853 > Proboscina subechinata non Diaperoecia rugosa Canu & Bassler, 1920 non Osburn, 1940 > Diaperoecia rugosa non Plagioecia parvipora (Canu & Bassler, 1929) non Canu, 1922 > Plagioecia parvipora non Diastopora papyracea (d'Orbigny, 1853) non d'Orbigny, 1851 > Diastopora papyracea non Mesenteripora foliacea (d'Orbigny, 1852) non (Lamouroux, 1821) > Mesenteripora foliacea non Crisisina carinata (R√∂mer, 1840) non (Reuss, 1846) > Crisisina carinata non Berenicea undata Canu & Bassler, 1920 non Canu, 1931 > Berenicea undata non Berenicea stipata Canu & Bassler, 1920 non Canu, 1917 > Berenicea stipata non Multicrescis mamillosa Canu & Bassler, 1926 non (R√∂mer, 1840) > Multicrescis mamillosa non Calloporella lamellaris (Bekker, 1921) non (Modzalevskaya, 1955) > Calloporella lamellaris non Homotrypa similis Foord, 1883 non Caley, 1936 > Homotrypa similis non Monticulipora affinis Poƒçta, 1902 non (Ulrich, 1890) > Monticulipora affinis non Stenopora permiana Yang, 1958 non (Bassler, 1929) > Stenopora permiana non Stenopora meekana (Girty, 1907) non Ulrich, 1890 > Stenopora meekana non Meliceritites transversa Canu & Bassler, 1926 non (d'Orbigny, 1852) > Meliceritites transversa non Antedon longicirra (AH Clark, 1912) non Carpenter, 1888 > Antedon longicirra non Porina reussi Meneghini in De Amicis, 1885 vide Neviani (1900) > Porina reussi vide

As far as I know non, nec, ms, fide, or vide are not legitimate epithets for any species or subspecies. Catalogue of Life has a few ciliates with “non” as the infraspecific epithet, but the GSD that provides these names has all kind of other data quality problems, so I think these epithets are probably also artifacts due to similar parsing errors in the past.

dimus commented 3 years ago

@KatjaSchulz, do I understand correctly that 'Aus bus Beck in Ken', 'Aus bus Beck ms in Ken', 'Aus bus Beck ex Ken' are all variants of the same?

KatjaSchulz commented 3 years ago

Yes, I think these can be variants of the same name, albeit with slight differences in meaning. They all indicate that the author of the name (Beck) is not the author of the work in which the name was published. The ms (or sometimes also MS) apparently indicates that the name originated in an unpublished manuscript, e.g., here's the Chimonides, 1987 reference for "Amathia tricornis Busk ms in Chimonides, 1987": https://www.biodiversitylibrary.org/page/2301924

dimus commented 3 years ago

What do you think about treating it this way @KatjaSchulz? It is similar to how we currently do it for in and ex

{
  "parsed": true,
  "quality": 2,
  "qualityWarnings": [
    {
      "quality": 2,
      "warning": "Ex authors are not required"
    }
  ],
  "verbatim": "Amathia tricornis Busk ms in Chimonides, 1987",
  "normalized": "Amathia tricornis Busk ex Chimonides 1987",
  "canonical": {
    "stemmed": "Amathia tricorn",
    "simple": "Amathia tricornis",
    "full": "Amathia tricornis"
  },
  "cardinality": 2,
  "authorship": {
    "verbatim": "Busk ms in Chimonides, 1987",
    "normalized": "Busk ex Chimonides 1987",
    "authors": [
      "Busk"
    ],
    "originalAuth": {
      "authors": [
        "Busk"
      ],
      "exAuthors": {
        "authors": [
          "Chimonides"
        ],
        "year": {
          "year": "1987"
        }
      }
    }
  }
}
KatjaSchulz commented 3 years ago

Looks perfect, thanks!