gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

Cannot parse some unusual epithets #199

Closed dimus closed 2 years ago

dimus commented 2 years ago

From #191 by @KatjaSchulz: Names with strange epithets that end up in an unparsed tail:

Seleuca chûjôi Voss, 1957 Peperomia non-alata Trel. Hyacinthoides non-scripta (L.) Chouard ex Rothm. Monocelis non-scripta Curini-Galletti, 2014 Macromitrium st.-johnii E. B. Bartram, 1940

abubelinha commented 2 years ago

From https://github.com/gnames/gnparser/issues/191#issuecomment-912966149

Names with seemingly innocuous epithets that nevertheless end up in an unparsed tail:

Profusulinella оblопgа Potievskaya, 1964
Tetrataxis toгosus Postoyalko, 1975
Anomalina badkhyseпsis Kuryleva, 1973
Bigenerina iпfrapaleogenica Suleymanov, 1963
Carpelimus (Trogophloeus) rougemoпti Gildenkov, 2014

All of these contain cyrillic characters that not suppose to be in scientific names: п, г, so they are not parsed correctly

gnparser silently replaces "fake spaces" (mostly non-breaking space symbols coming from copy-pasted web pages, I guess: very difficult to track them). They are turned into normal spaces in the parsed results (normalized, canonicals, authorship and so on). No warnings. Great job.

Those п and г cyrillic characters do so much resemble n and r Latin characters. Very difficult also to detect them when looking at an excel file, for example. I guess those symbols were probably typed because the writer's keyboard had not the Latin ones he/she was trying to reproduce. Would it be perhaps possible for gnparser to automatically detect and replace them for their Latin "resembled equivalents", and rise a quality warning "Non-standard characters in canonical"?

Yes, I know those cyrillic characters could be part of the authorship. But the same is already happening to some other characters which are valid in authorships but not in Latin/latinized names, and gnparser is correctly detecting, replacing and warning. Much like it already happens to ñ (parsing "Anthyllis peñalarensis Vázquez, Piñón & Montañés", returns species:"penalarensis" and canonical "Anthyllis penalarensis", but authorship "Vázquez, Piñón & Montañés", and this quality warning: "Non-standard characters in canonical").

A similar approach could be used for chûjôi epithet above: change úùûü into u, óòôöø into o, and so on with all a, e and i variants.

abubelinha commented 2 years ago

Some vowels could be more tricky than others: I see gnparser is already changing ö into oe (instead of o). Perhaps that could make sense if we were dealing with phonetics, but this is not the case of gnparser.

I believe replacing ö into a simple o makes more sense. It would be a much more conservative approach, not making any assumptions about the intended pronunciation because the same symbol can be used for different sounds in different languages (ö for example).

In my opinion, for gnparser it should be just a matter of replacing symbols which are invalid in Latin words. So why not use the closest Latin symbol from which they were derived ("ö" is the Latin letter "o" modified with an umlaut or diaeresis).

Otherwise, gnparser should do the same with ñ, which does never sound like n (it sounds more like ny in Spanish, similar to the Portuguese nh combination). So I think it would be much better to keep things simple and use n and o in both cases.

This leads me to the question of double-vowel characters. These are not frequent but I have seen them before in some scientific names. Example: Hierochlœ instead of Hierochloe I think gnparser is already parsing them correctly, and slicing that symbol into two separated latin vowel symbols.

dimus commented 2 years ago

Those cyrillic characters do so much resemble n and r Latin characters. Very difficult also to detect them when looking at an excel file, for example. Would it be perhaps possible for gnparser to automatically detect and replace them for their Latin "resembled equivalents", and rise a quality warning "Non-standard characters in canonical"?

I found that it is wise to step at some point of trying to "fix" apparent problems in the name. For example GNparser does not try to parse names with capitalized epithets, or with low-case genera (unless it is exclicitly required by setting). Otherwise the amount of false positives becomes too big, and really bad names get into datasets.

It is a judgement call what to parse and what not to parse, and often it is just an intuition. There are a few diacritics that used to be legal in scientific names, these are "allowed" to a degree, but cyrillic letters never were allowed, so I consider them to be an unparseable mistake.

dimus commented 2 years ago

Seleuca chûjôi Voss, 1957

I think this name belongs to unpaseable category

dimus commented 2 years ago

This leads me to the question of double-vowel characters.

Sadly there is no good solution. Some letters are easy like œ as almost everyboy (and the codes) normalize them the same way to oe. Other diacritcis are very tricky

  1. The rules abour their transliteration are confusing and depend on a country of the name author as well as the year when the name was published
  2. People transliterate these names differently in different datasets, so sometimes ö is transliterated to o and sometimes to oe for the same name-string.
abubelinha commented 2 years ago

Sorry, I was editing my original comments while you answered, and finished and posted later on.

In summary, for the ö case, I think o is a much more conservative approach than oe (which looks like a germanic phonetic replacement, but gnparser does not do that in other cases like ñ, which is replaced by n despite it sounds more like ny in Spanish).

New comment now:

As there could be different opinions about this, I wonder if in a future version it could be possible to feed gnparser with an array of replacements (i.e. a config file, or something we can post through the api) so we can force it to turn ó/ò/ô/ø/ö into o (instead of oe), п/ñ into n, г into r, and so on (a user choice to override defaults).

Perhaps the cyrillic characters issue (keyboard-originated / OCR-originated / orthographic corrector-originated?) could be frequent in some scenarios, and it would be good letting gnparser correct this when we know it's happening. Ortographic correctors have the side effect of putting first-letter uppercases in some of your words (after "subsp." or "var."); and depending on the orthographic corrector language, they could be the origin of some of the accented characters in latin names.

abubelinha commented 2 years ago

Regarding some of the strange epithets which cannot be parsed, I have just found that "Lychnis flos-cuculi L." is correctly parsed. So the problem in non-scripta and st.-johnii epithets seems to be the non or st. substrings, rather than the - symbol