gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
39 stars 4 forks source link

Author strings with y in their name are not parsed #93

Closed dimus closed 3 years ago

dimus commented 3 years ago

created by @gdower at https://gitlab.com/gogna/gnparser/-/issues/93

In the scientific name, Struthiopteris fallax (Lange) S.Molino, Gabriel y Galán & Wasowicz, the y Galán & Wasowicz component becomes an unparsed tail. I realize that might not be easy to fix because of it might get a lot of false matches in BHL, and we were able to resolve the issue on our end, so no worries if it can't be fixed.

https://parser.globalnames.org/?q=Struthiopteris+fallax+%28Lange%29+S.Molino%2C+Gabriel+y+Gal%C3%A1n+%26+Wasowicz

{
  "parsed": true,
  "quality": 3,
  "qualityWarnings": [
    [3,"Unparsed tail"]
  ],
  "verbatim": "Struthiopteris fallax (Lange) S.Molino, Gabriel y Galán \u0026 Wasowicz",
  "normalized": "Struthiopteris fallax (Lange) S. Molino \u0026 Gabriel",
  "cardinality": 2,
  "canonicalName": {
    "full": "Struthiopteris fallax",
    "simple": "Struthiopteris fallax",
    "stem": "Struthiopteris fallax"
  },
  "authorship": "(Lange) S. Molino \u0026 Gabriel",
  "details": [
    {
      "genus": {
        "value": "Struthiopteris"
      },
      "specificEpithet": {
        "value": "fallax",
        "authorship": {
          "value": "(Lange) S. Molino \u0026 Gabriel",
          "basionymAuthorship": {
            "authors": [
              "Lange"
            ]
          },
          "combinationAuthorship": {
            "authors": [
              "S. Molino",
              "Gabriel"
            ]
          }
        }
      }
    }
  ],
  "positions": [
    ["genus",0,14],
    ["specificEpithet",15,21],
    ["authorWord",23,28],
    ["authorWord",30,32],
    ["authorWord",32,38],
    ["authorWord",40,47]
  ],
  "surrogate": false,
  "virus": false,
  "hybrid": false,
  "bacteria": false,
  "unparsedTail": " y Galán \u0026 Wasowicz",
  "nameStringId": "ac36333c-ad8f-5389-abe4-fe1bea5c7a92",
  "parserVersion": "v0.14.1"
}

Re: https://github.com/CatalogueOfLife/testing/issues/2

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/43

Looks like if not common, it is not unique:

Caloptenopsis crassiusculus (Martínez y Fernández-Castillo, 1896)
Caluromytrema martindelcampoi Lamothe y Pineda, 1989
Capillaria xochimilcensis Caballero y Zerecero, 1943
Carabus (Tanaocarabus) hendrichsi Bolvar y Pieltain, Rotger & Coronado-G 1967
Didymosella acutirostris Faura y Sans & Canu 1917
Dufourea fuenti Dusmet y Alonso, 1935

So it has to be solved in a general way.

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/44

I think it is fixable. I suspect that y is a rare prefix, so may be I need to start a list of verbatim authors.