Support graft-chimeras - Githubissues

tobymarsden commented 2 years ago

I'm trying to get gnparser to parse all names in Kew's Plants of the World Online.

I bumped into a parsing failure when dealing with graft-chimeras, e.g.

+ Crataegomespilus
Cytisus purpureus + Laburnum anagyroides
Crataegus + Mespilus

This PR parses these names successfully without any impact on existing test cases, e.g.

{
  "parsed": true,
  "quality": 2,
  "qualityWarnings": [
    {
      "quality": 2,
      "warning": "Named graft-chimera"
    }
  ],
  "verbatim": "+ Crataegomespilus",
  "normalized": "+ Crataegomespilus",
  "canonical": {
    "stemmed": "Crataegomespilus",
    "simple": "Crataegomespilus",
    "full": "+ Crataegomespilus"
  },
  "cardinality": 1,
  "hybrid": "NAMED_GRAFT_CHIMERA",
  "details": {
    "uninomial": {
      "uninomial": "Crataegomespilus"
    }
  },
  "words": [
    {
      "verbatim": "+",
      "normalized": "+",
      "wordType": "GRAFT_CHIMERA_CHAR",
      "start": 0,
      "end": 1
    },
    {
      "verbatim": "Crataegomespilus",
      "normalized": "Crataegomespilus",
      "wordType": "UNINOMIAL",
      "start": 2,
      "end": 18
    }
  ],
  "id": "408e8fc7-fa27-53a6-9eff-37cb779724e4",
  "parserVersion": "test_version"
}

and

{
  "parsed": true,
  "quality": 2,
  "qualityWarnings": [
    {
      "quality": 2,
      "warning": "Graft-chimera formula"
    }
  ],
  "verbatim": "Cytisus purpureus + Laburnum anagyroides",
  "normalized": "Cytisus purpureus + Laburnum anagyroides",
  "canonical": {
    "stemmed": "Cytisus purpure + Laburnum anagyroid",
    "simple": "Cytisus purpureus + Laburnum anagyroides",
    "full": "Cytisus purpureus + Laburnum anagyroides"
  },
  "cardinality": 0,
  "hybrid": "GRAFT_CHIMERA_FORMULA",
  "details": {
    "graftChimeraFormula": [
      {
        "species": {
          "genus": "Cytisus",
          "species": "purpureus"
        }
      },
      {
        "species": {
          "genus": "Laburnum",
          "species": "anagyroides"
        }
      }
    ]
  },
  "words": [
    {
      "verbatim": "Cytisus",
      "normalized": "Cytisus",
      "wordType": "GENUS",
      "start": 0,
      "end": 7
    },
    {
      "verbatim": "purpureus",
      "normalized": "purpureus",
      "wordType": "SPECIES",
      "start": 8,
      "end": 17
    },
    {
      "verbatim": "+",
      "normalized": "+",
      "wordType": "GRAFT_CHIMERA_CHAR",
      "start": 18,
      "end": 19
    },
    {
      "verbatim": "Laburnum",
      "normalized": "Laburnum",
      "wordType": "GENUS",
      "start": 20,
      "end": 28
    },
    {
      "verbatim": "anagyroides",
      "normalized": "anagyroides",
      "wordType": "SPECIES",
      "start": 29,
      "end": 40
    }
  ],
  "id": "a8f8ace8-ba1a-5371-b9d5-73efce81d52c",
  "parserVersion": "test_version"
}

I've reused the hybrid flag to make consumption of the JSON output easier; notwithstanding that these aren't true botanical hybrids, it seems reasonable to use the term in the broadest sense given that it's a string value with more details anyway.

I had to adjust the stemmer but I added some stemmer-specific tests in.

The PR duplicates much of the HybridFormula code as the syntax is so close; I've another branch which refactors things to reuse the HybridFormula objects, but there was no performance benefit and the code is harder to follow (for me, anyway). If you prefer that approach, though, I can submit a PR from that branch instead.