gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

Hybrid character sometimes missing #174

Closed tobymarsden closed 3 years ago

tobymarsden commented 3 years ago

When parsing Magnolia x soulangeana, the words section of the details looks like this:

"words": [
    {
      "verbatim": "Magnolia",
      "normalized": "Magnolia",
      "wordType": "GENUS",
      "start": 0,
      "end": 8
    },
    {
      "verbatim": "×",
      "normalized": "×",
      "wordType": "HYBRID_CHAR",
      "start": 9,
      "end": 10
    },
    {
      "verbatim": "soulangeana",
      "normalized": "soulangeana",
      "wordType": "SPECIES",
      "start": 11,
      "end": 22
    }
  ],

(I wonder if verbatim should be x instead of ×, as currently subsp is to subsp., but that's a nitpick.)

However, when parsing Magnolia denudata x Magnolia liliiflora, the output is:

  "words": [
    {
      "verbatim": "Magnolia",
      "normalized": "Magnolia",
      "wordType": "GENUS",
      "start": 0,
      "end": 8
    },
    {
      "verbatim": "denudata",
      "normalized": "denudata",
      "wordType": "SPECIES",
      "start": 9,
      "end": 17
    },
    {
      "verbatim": "",
      "normalized": "",
      "wordType": "HYBRID_CHAR",
      "start": 18,
      "end": 19
    },
    {
      "verbatim": "Magnolia",
      "normalized": "Magnolia",
      "wordType": "GENUS",
      "start": 20,
      "end": 28
    },
    {
      "verbatim": "liliiflora",
      "normalized": "liliiflora",
      "wordType": "SPECIES",
      "start": 29,
      "end": 39
    }
  ]

i.e. the HYBRID_CHAR word has empty verbatim and normalized properties.

The same applies to names like × Sorbopyrus auricularis.

Is this a bug, and if so, would you consider a PR?

dimus commented 3 years ago

definitely a bug, and yes PR would be fantastic if you are up to it

tobymarsden commented 3 years ago

@dimus PR at https://github.com/gnames/gnparser/pull/175

dimus commented 3 years ago

I see that preprocessing adds to the problem, because there is a substitution of all hybrid characters to ×. I will need to think a bit how to reorganize the code to get the correct verbatim.

dimus commented 3 years ago

The problem was largely caused by a code debt, where an unnecessary legacy struct parser.wordNode was shoehorned into parsed.Word. I removed the legacy struct. Also I added test_data_cultivars.md to tools/gentest.go to simplify test generation where many changes are introduced. I am adding a section how to use the tool to CONTRIBUTING.md