gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

Support graft-chimeras #194

Closed tobymarsden closed 2 years ago

tobymarsden commented 2 years ago

I'm trying to get gnparser to parse all names in Kew's Plants of the World Online.

I bumped into a parsing failure when dealing with graft-chimeras, e.g.

+ Crataegomespilus
Cytisus purpureus + Laburnum anagyroides
Crataegus + Mespilus

This PR parses these names successfully without any impact on existing test cases, e.g.

{
  "parsed": true,
  "quality": 2,
  "qualityWarnings": [
    {
      "quality": 2,
      "warning": "Named graft-chimera"
    }
  ],
  "verbatim": "+ Crataegomespilus",
  "normalized": "+ Crataegomespilus",
  "canonical": {
    "stemmed": "Crataegomespilus",
    "simple": "Crataegomespilus",
    "full": "+ Crataegomespilus"
  },
  "cardinality": 1,
  "hybrid": "NAMED_GRAFT_CHIMERA",
  "details": {
    "uninomial": {
      "uninomial": "Crataegomespilus"
    }
  },
  "words": [
    {
      "verbatim": "+",
      "normalized": "+",
      "wordType": "GRAFT_CHIMERA_CHAR",
      "start": 0,
      "end": 1
    },
    {
      "verbatim": "Crataegomespilus",
      "normalized": "Crataegomespilus",
      "wordType": "UNINOMIAL",
      "start": 2,
      "end": 18
    }
  ],
  "id": "408e8fc7-fa27-53a6-9eff-37cb779724e4",
  "parserVersion": "test_version"
}

and

{
  "parsed": true,
  "quality": 2,
  "qualityWarnings": [
    {
      "quality": 2,
      "warning": "Graft-chimera formula"
    }
  ],
  "verbatim": "Cytisus purpureus + Laburnum anagyroides",
  "normalized": "Cytisus purpureus + Laburnum anagyroides",
  "canonical": {
    "stemmed": "Cytisus purpure + Laburnum anagyroid",
    "simple": "Cytisus purpureus + Laburnum anagyroides",
    "full": "Cytisus purpureus + Laburnum anagyroides"
  },
  "cardinality": 0,
  "hybrid": "GRAFT_CHIMERA_FORMULA",
  "details": {
    "graftChimeraFormula": [
      {
        "species": {
          "genus": "Cytisus",
          "species": "purpureus"
        }
      },
      {
        "species": {
          "genus": "Laburnum",
          "species": "anagyroides"
        }
      }
    ]
  },
  "words": [
    {
      "verbatim": "Cytisus",
      "normalized": "Cytisus",
      "wordType": "GENUS",
      "start": 0,
      "end": 7
    },
    {
      "verbatim": "purpureus",
      "normalized": "purpureus",
      "wordType": "SPECIES",
      "start": 8,
      "end": 17
    },
    {
      "verbatim": "+",
      "normalized": "+",
      "wordType": "GRAFT_CHIMERA_CHAR",
      "start": 18,
      "end": 19
    },
    {
      "verbatim": "Laburnum",
      "normalized": "Laburnum",
      "wordType": "GENUS",
      "start": 20,
      "end": 28
    },
    {
      "verbatim": "anagyroides",
      "normalized": "anagyroides",
      "wordType": "SPECIES",
      "start": 29,
      "end": 40
    }
  ],
  "id": "a8f8ace8-ba1a-5371-b9d5-73efce81d52c",
  "parserVersion": "test_version"
}

I've reused the hybrid flag to make consumption of the JSON output easier; notwithstanding that these aren't true botanical hybrids, it seems reasonable to use the term in the broadest sense given that it's a string value with more details anyway.

I had to adjust the stemmer but I added some stemmer-specific tests in.

The PR duplicates much of the HybridFormula code as the syntax is so close; I've another branch which refactors things to reuse the HybridFormula objects, but there was no performance benefit and the code is harder to follow (for me, anyway). If you prefer that approach, though, I can submit a PR from that branch instead.

dimus commented 2 years ago

I've reused the hybrid flag to make consumption of the JSON output easier; notwithstanding that these aren't true botanical hybrids, it seems reasonable to use the term in the broadest sense given that it's a string value with more details anyway.

I think it is OK, because gnparser in general uses very 'broad' semantics in other parts, for example virus flag includes everything that is not cellular. I think v1 of GNparser is about practicality, and covering its domain. And v2 might become a more scientically accurate in its definitions.

tobymarsden commented 2 years ago

@dimus Awesome, thanks for looking at this!

dimus commented 2 years ago

@tobymarsden I asked around, and looked at the codes. It seems that graft-chimeras are completely in the realm of cultivars code, so it would be logical to parse them only when cultivar flag is on. Can you make this change in your PR and make them 'visible' only if cultivar flag is used? I think their tests also should be in cultivar test file.

dimus commented 2 years ago

I think if people go through names that suppose to be in ICN context, parser should break on graft-chimera names.

tobymarsden commented 2 years ago

@dimus Makes perfect sense. I'll try to find some time this week to make the changes to the PR.

dimus commented 2 years ago

sounds great @tobymarsden

tobymarsden commented 2 years ago

@dimus The graft-chimera support is now contingent on the -C flag, and parsing breaks on graft chimeras without it.

I've updated the tests so the parsed graft-chimeras are in the cultivars file, and the main test file shows "parsed":false for these names.

dimus commented 2 years ago

@tobymarsden perfect! Trying it now...

dimus commented 2 years ago

It all looks good to me, @tobymarsden, great work, merging...