gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
40 stars 5 forks source link

Authors are interpreted as subgenera #265

Open KatjaSchulz opened 5 months ago

KatjaSchulz commented 5 months ago

These are all valid/accepted names from the current version of the Catalogue of Life

Plant genera Nassella (Trin.) É.Desv. – simple: Trin. – full: Nassella subgen. Trin. Dacrycarpus (Endl.) de Laub. – simple: Endl. – full: Dacrycarpus subgen. Endl. Lysiphyllum (Benth.) de Wit – simple: Benth. – full: Lysiphyllum subgen. Benth. Tricholemma (Röser) Röser – simple: Roeser – full: Tricholemma subgen. Roeser Isogonium (Kützing) de Bary – simple: Kuetzing – full: Isogonium subgen. Kuetzing Euptilota (Kützing) Kützing, 1849 – simple: Kuetzing – full: Euptilota subgen. Kuetzing Setiechinopsis (Backeb.) de Haas – simple: Backeb. – full: Setiechinopsis subgen. Backeb.

Chromista genera Cyclotella (Kützing) de Brebisson – simple: Kuetzing – full: Cyclotella subgen. Kuetzing Tabularia (Kützing) Williams & Round – simple: Kuetzing – full: Tabularia subgen. Kuetzing Cyrtolophosis (Schew.) – simple: Schew. – full: Cyrtolophosis subgen. Schew. Pyrocystis (Schütt) Lemmermann, 1899 – simple: Schuett – full: Pyrocystis subgen. Schuett

Chromista families Anaulaceae (Schütt) Lemmermann – simple: Schuett – full: Anaulaceae subgen. Schuett Triceratiaceae (Schütt) Lemmermann – simple: Schuett – full: Triceratiaceae subgen. Schuett Pyxillaceae (Schütt) Simonsen – simple: Schuett – full: Pyxillaceae subgen. Schuett Pyrocystaceae (Schütt) Lemmermann, 1899 – simple: Schuett – full: Pyrocystaceae subgen. Schuett Aulacodiscaceae (Schütt) Lemmermann – simple: Schuett – full: Aulacodiscaceae subgen. Schuett Stictodiscaceae (Schütt) Simonsen – simple: Schuett – full: Stictodiscaceae subgen. Schuett Lauderiaceae (Schütt) Lemmermann – simple: Schuett – full: Lauderiaceae subgen. Schuett

Protozoa family Cyrtolophosidiidae (Schew.) – simple: Schew. – full: Cyrtolophosidiidae subgen. Schew.

dimus commented 3 months ago

thank you @KatjaSchulz for catching this, I am not sure yet how to fix this, because in many cases Aus (Bus) does mean Aus subgen Bus.

Do names like this happen for bonaty names specifically?

KatjaSchulz commented 3 months ago

Yes, this is a tricky one. All the examples I found were taxa under the botanical code, except for the Cyrtolophosidiidae (Schew.) example which is a really weird one that has since been removed from COL.

One approach to fix this could be a blacklist of strings that can never be interpreted as subgenus names. I think it's pretty safe to put the author strings above on that list. But after digging some more, I also found this name: Sigmoidotropis (Piper) A.Delgado. I don't think there are any subgenera named Piper, but I don't know if I would be comfortable putting that name on the blacklist.

Another approach would be to add processing of rank information to gnparser. I usually have that information for most names I am trying to parse, and I use it to double-check the gnparser results. I realize that would probably be quite a bit of work to implement.

Anyway, here are a few more names I found in the COL 2024 annual archive:

Plant genera;

Hexaphylla (Klokov) P.Caputo & Del Guacchio – simple: Klokov – full: Hexaphylla subgen. Klokov Parogonum (Haraldson) Desjardins & J. P. Bailey – simple: Haraldson – full: Parogonum subgen. Haraldson Ericetorum (Jermy) Li Bing Zhang & X. M. Zhou – simple: Jermy – full: Ericetorum subgen. Jermy Archidasyphyllum (Cabrera) P. L. Ferreira, Saavedra & Groppo – simple: Cabrera – full: Archidasyphyllum subgen. Cabrera Lamyropsis (Kharadze) Dittrich – simple: Kharadze – full: Lamyropsis subgen. Kharadze Sigmoidotropis (Piper) A.Delgado – simple: Piper – full: Sigmoidotropis subgen. Piper Moquiniastrum (Cabrera) G. Sancho – simple: Cabrera – full: Moquiniastrum subgen. Cabrera

Chromista genera:

Hormosira (Endlichter) Meneghini, 1838 – simple: Endlichter – full: Hormosira subgen. Endlichter Syracolithus (Kamptner) Deflandre in Grassé, 1952 – simple: Kamptner – full: Syracolithus subgen. Kamptner

dimus commented 3 months ago

I do have a list of Botanical genera authors (https://github.com/gnames/gnparser/blob/master/io/dict/data/genera_auth_icn.txt), and, if they are not ambiguous, I treat the author-matching text in parentheses after genus for bi- trinomials as authorship. I can expand this rule to uninomials as well.

This is pretty close to your suggestion @KatjaSchulz, as I understood it

dimus commented 1 month ago

@KatjaSchulz would implementation of #267 help for your use case? If all names are botanical, we would not have ambiguity in parsing such names

KatjaSchulz commented 1 month ago

Yes, I think so. Since I am usually running comprehensive data sets through gnparser, it would be a little bit more work to separate names by code, but it would be feasible. There may be lingering problems with some microorganisms, but I think those would be negligible. Thanks!

dimus commented 2 weeks ago

Ups, did not mean to close this one, reopening...

Some plant names are now recognized, some still have problems, and Chromista authors are not recognized yet.

There is a new option: code. It allows to force names to be parsed by ICN rules:

https://parser.globalnames.org/api/Hormosira%20(Endlichter)%20Meneghini,%201838?code=bot

https://parser.globalnames.org/?code=botanical&format=html&names=Syracolithus+%28Kamptner%29+Deflandre+in+Grass%C3%A9%2C+1952&with_details=on

Supported values: bact, bacterial, ICNP, bot, botanical, ICN, cult, cultivar, ICNCP, zoo, zoological, ICZN.