Figure out how to treat diacritics better

dimus commented 2 years ago

@abubelinha raised the following in #199:

In summary, for the ö case, I think o is a much more conservative approach than oe (which looks like a germanic phonetic replacement, but gnparser does not do that in other cases like ñ, which is replaced by n despite it sounds more like ny in Spanish).

New comment now:

As there could be different opinions about this, I wonder if in a future version it could be possible to feed gnparser with an array of replacements (i.e. a config file, or something we can post through the api) so we can force it to turn ó/ò/ô/ø/ö into o (instead of oe), п/ñ into n, г into r, and so on (a user choice to override defaults).

Perhaps the cyrillic characters issue (keyboard-originated / OCR-originated / orthographic corrector-originated?) could be frequent in some scenarios, and it would be good letting gnparser correct this when we know it's happening. Ortographic correctors have the side effect of putting first-letter uppercases in some of your words (after "subsp." or "var."); and depending on the orthographic corrector language, they could be the origin of some of the accented characters in latin names.

dimus commented 2 years ago

ICN and ICZN treat diacritics differenty, on top of that, people transliterate them inconsistently from case to case. So may we can have several lexical variants for the same name:

1. Aus bös
2. Aus boes
3. Aus bos

while 1 and 2 will get the same canonical form "Aus boes", the 3rd one will get "Aus bos"

For long names it is still not a huge problem, as the names will match fuzzily, but for short names fuzzy algorithm will not work to avoid false positives.

Proposed idea:

Keep Canonical.Full and Canonical.Simple the same as now
When generating Canonical.Stem transliterate all "oe" to "o", do the same for all other german diacritics.

Positive outcome: All 3 cases from the example above will match

Negative outcome: We might create significant number of false positives.

tobymarsden commented 2 years ago

@dimus A name demonstrating an issue that I'm seeing is Leptochloöpsis virgata.

Currently the output is (incorrectly for ICN):

  "verbatim": "Leptochloöpsis virgata",
  "normalized": "Leptochlooepsis virgata",
  "canonical": {
    "stemmed": "Leptochlooepsis uirgat",
    "simple": "Leptochlooepsis virgata",
    "full": "Leptochlooepsis virgata"
  },

With the new --diaereses option enabled, this is the output:

"verbatim": "Leptochloöpsis virgata",
  "normalized": "Leptochloöpsis virgata",
  "canonical": {
    "stemmed": "Leptochloopsis uirgat",
    "simple": "Leptochloöpsis virgata",
    "full": "Leptochloöpsis virgata"
  },

(Note the transliteration of the ö in stemmed.)

While I think your proposed idea is an improvement, I wonder if ä,ö,ü should be always be transliterated to a,o,u when they come after a vowel, everywhere (not just stemmed). Otherwise there's no way to correctly parse e.g. Leptochloöpsis virgata without choosing to preserve the diaereses. I confess I don't know the implications of that...

Archilegt commented 1 year ago

I don't know if parsing is meant to match the Codes. Above is mentioned that names are treated differently among the Codes. As per the ZooCode, article 32.5.2:

32.5.2. A name published with a diacritic or other mark, ligature, apostrophe, or hyphen, or a
species-group name published as separate words of which any is an abbreviation, is to be corrected.

32.5.2.1. In the case of a diacritic or other mark, the mark concerned is deleted, except that in a
name published before 1985 and based upon a German word, the umlaut sign is deleted from a vowel
and the letter "e" is to be inserted after that vowel (if there is any doubt that the name is based upon a
German word, it is to be so treated).

Examples. nuñezi is corrected to nunezi, and mjøbergi to mjobergi, but mülleri (published before
1985) is corrected to muelleri.

Forcing ñ into n is ZooCode-compliant. Go for it.

For German umlauts, the parser would have to:

read [article metadata] year,
read [article metadata] language,
if year < 1985 and language = DE,
then ä = ae, ö = oe, ü = ue, 
if year < 1985 and language = non-DE,
then ä = a, ö = o, ü = u, ó/ò/ô/ø/ö = o,
if year => 1985 and language = any,
then ä = a, ö = o, ü = u, ó/ò/ô/ø/ö = o

Archilegt commented 1 year ago

If this "German issue" is fixed, we can definitely include it in the Verhoeff paper GNA module.

dimus commented 1 year ago

German issue is "fixed" to the best of our abilities, for example:

http://parser.globalnames.org/?format=html&names=Ortygospiza+atricollis+m%C3%BClleri&with_details=on

GNparser treats names with ü, ö, ä as German names before 1985. As names are coming to parser without a context, it is the best we could come up with.

Archilegt commented 1 year ago

One question: Are "wordType" values open to changes? For example: genus to genericName, species to specificEpithet, infraspecies to infraspecificEpithet.

Archilegt commented 1 year ago

One question: Are "wordType" values open to changes? For example: genus to genericName species to specificEpithet infraspecies to infraspecificEpithet.

dimus commented 1 year ago

One question: Are "wordType" values open to changes? For example: genus to genericName species to specificEpithet infraspecies to infraspecificEpithet.

I decided on shorter names because it saves a bit of a bandwidth, and I considered that genus, species and infraspecies would be enough to explain the intention of the field. I guess for a real taxonomist these terms do sound kind of weird.

I can change the terms, @Archilegt , however it would create a backward incompatibility. I did ask a few taxonomists (when I was developing the first version of the parser in 2008) if shortened values bother them, and got an answer that it was not a biggie for them. So since then the values did stay as they are now, but may be your suggestion is better, can you tell your motivation for the change?

Archilegt commented 1 year ago

@dimus, great to read about the background! The main motivation for the change is aiming at all of us speaking the same language. In a way, the Codes of Nomenclature are biodiversity informatics standards, and terms and definitions contained in the Codes are being adopted by other standards like DarwinCore. With DC becomes more widely used and understood, that creates a larger community speaking "the language". Any software that reuses the same language would benefit from better understanding by the community. In general, the less "mapping" we need from software to (human or machine) user, the better. :)

dimus commented 1 year ago

@Archilegt I think it is a valid motivation for this change and clarity is worth of eating a little more bandwidth. It would create a compatibility problem for people though, and probably would require v2.x.x for the parser.

That means people who use v1 API will not automatically receive improvements anymore. It would also create a necessity to keep several APIs versions on our side running "in perpetuity".

So I will make an issue from your suggestion, mark it with 'v2' tag and see if other issues demanding backward incompatibility will tip the balance and v2 will need to become a reality.

gnames / gnparser

Figure out how to treat diacritics better #201