gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
39 stars 4 forks source link

"ß" not recognized as "Non-standard characters in canonical" but left unparsed instead #90

Closed dimus closed 3 years ago

dimus commented 3 years ago

created by @LocoDelAssembly at https://gitlab.com/gogna/gnparser/-/issues/90

Not sure if this is intended?

Here is the example: $ echo Dreyfusia nüßlini | gnparser -f pretty

{
  "parsed": true,
  "quality": 3,
  "qualityWarnings": [
    [3,"Unparsed tail"]
  ],
  "verbatim": "Dreyfusia nüßlini",
  "normalized": "Dreyfusia",
  "cardinality": 1,
  "canonicalName": {
    "full": "Dreyfusia",
    "simple": "Dreyfusia",
    "stem": "Dreyfusia"
  },
  "details": [
    {
      "uninomial": {
        "value": "Dreyfusia"
      }
    }
  ],
  "positions": [
    ["uninomial",0,9]
  ],
  "surrogate": false,
  "virus": false,
  "hybrid": false,
  "bacteria": false,
  "unparsedTail": " nüßlini",
  "nameStringId": "27679e50-c41b-5a3d-b619-d378d503be8c",
  "parserVersion": "v0.14.1"
}

Expected:

{
  "parsed": true,
  "quality": 2,
  "qualityWarnings": [
    [2,"Non-standard characters in canonical"]
  ],
  "verbatim": "Dreyfusia nüßlini",
  "normalized": "Dreyfusia nueslini", // (*)
  "cardinality": 2,
  "canonicalName": {
    "full": "Dreyfusia nueslini",
    "simple": "Dreyfusia nueslini",
    "stem": "Dreyfusia nueslin"
  },
  "details": [
    {
      "genus": {
        "value": "Dreyfusia"
      },
      "specificEpithet": {
        "value": "nueslini"
      }
    }
  ],
  "positions": [
    ["genus",0,9],
    ["specificEpithet",10,17]
  ],
  "surrogate": false,
  "virus": false,
  "hybrid": false,
  "bacteria": false,
  "nameStringId": "ddf71a85-2e40-503f-8066-329ef77ac95a",
  "parserVersion": "v0.14.1"
}

Not sure if "s" is correct replacement. More info: https://en.wikipedia.org/wiki/%C3%9F

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/43

mentioned in commit 2e0be2772f4f54974411f2a0aed63ae2276e473b

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/44

Yes, my understanding is that ß transliterates as ss usually

dimus commented 3 years ago

created by @typophyllum at https://gitlab.com/gogna/gnparser/-/issues/45

The Aphid Species File has "nuesslini" as correct spelling:

http://aphid.speciesfile.org/Common/basic/Taxa.aspx?TaxonNameID=1159392

dimus commented 3 years ago

created by @LocoDelAssembly at https://gitlab.com/gogna/gnparser/-/issues/46

After discussing this with @typophyllum (Holger Braun) likely more correct would be "ss". With extra context sometimes could be "s" and also "ü" could be "ue".