gnames / gnfinder

GNfinder finds scientific names in UTF8 texts, PDF files, MS Word/Excel documents, URLs etc.
MIT License
44 stars 5 forks source link

Wrong output when querying Catalogue of Life. #48

Closed oolonek closed 3 years ago

oolonek commented 4 years ago

Observed problem

It looks like the wrong output is returned when matching against the Catalogue of Life.

Example

Example on the entry Solanum tuberosum.

Expected output

Expected output, according to Catalogue of Life web page : Solanum tuberosum (this is indeed the accepted name). See Catalogue of Life output for 'Solanum tuberosum ' query.

Actual output

However, the output of GNFinder is Solanum etuberosum (see below).

{
      "type": "Binomial",
      "verbatim": "(Solanum tuberosum),",
      "name": "Solanum tuberosum",
      "odds": 815712.9591463371,
      "odds_details": {
        "Name": {
          "abbr": {
            "false": 0.8679430877999654
          },
          "uniEnd3": {
            "num": 23.597054331788886
          },
          "spLen": {
            "9": 10.081642477424332
          },
          "spDict": {
            "WhiteSpecies": 5628.6125203841275
          },
          "spEnd3": {
            "sum": 172.68753397354592
          },
          "PriorOdds": {
            "true": 0.1
          },
          "uniLen": {
            "7": 0.8324223452545503
          },
          "uniDict": {
            "GreyGenus": 1.4684399638974754
          }
        }
      },
      "start": 29336,
      "end": 29356,
      "annotation": "",
      "verification": {
        "dataSourceId": 1,
        "dataSourceTitle": "Catalogue of Life",
        "taxonId": "6208dd5855b41dfa4f99a4a2d0a55854",
        "matchedName": "Solanum tuberosum Bert. ex Walp.",
        "currentName": "Solanum etuberosum Lindl.",
        "isSynonym": true,
        "classificationPath": "Plantae|Tracheophyta|Magnoliopsida|Solanales|Solanaceae|Solanum|Solanum etuberosum",
        "dataSourcesNum": 27,
        "dataSourceQuality": "HasCuratedSources",
        "matchType": "ExactCanonicalMatch",
        "preferredResults": [
          {
            "dataSourceId": 11,
            "dataSourceTitle": "GBIF Backbone Taxonomy",
            "nameId": "4261d820-2c48-5313-a720-1d90bedc0c6a",
            "name": "Solanum tuberosum Bertero",
            "taxonId": "8555981"
          }
        ],
        "retries": 1
      }

Possible explanation

In fact it appears that, when matching Catalogue of Life, the returned entry is the first row. @adafede observed that the entries of Catalogue of Life are in fact ordered, 1 by Rank and 2 by Alphabetical order (in this case Solanum tuberosum Bert. ex Walp. > Solanum tuberosum L. > Solanum tuberosum Poepp. ex Walp.)

Expected behaviour of GNFinder

In these case, first filter by Name status = Accepted name and the return the corresponding output. How could this be done ? Is it doable on the GNFinder side or should it be taken care of at Catalogue of Life ?

Many thanks

Note that this behaviour is observed for a large number of entries. Another example: Pisonia grandis (accepted name) query returns Pisonia umbellifera

{
      "type": "Binomial",
      "verbatim": "Pisonia grandis",
      "name": "Pisonia grandis",
      "odds": 10804591456.75299,
      "odds_details": {
        "Name": {
          "spLen": {
            "7": 4.425451988126818
          },
          "spDict": {
            "WhiteSpecies": 5628.6125203841275
          },
          "spEnd3": {
            "dis": 105.1141511143323
          },
          "PriorOdds": {
            "true": 0.1
          },
          "uniLen": {
            "7": 0.8324223452545503
          },
          "uniDict": {
            "WhiteGenus": 20194.430603370172
          },
          "abbr": {
            "false": 0.8679430877999654
          },
          "uniEnd3": {
            "nia": 2.8282746815509467
          }
        }
      },
      "start": 23593,
      "end": 23608,
      "annotation": "",
      "verification": {
        "dataSourceId": 1,
        "dataSourceTitle": "Catalogue of Life",
        "taxonId": "c4b0b41c2961b29ea3b447b6b903ad68",
        "matchedName": "Pisonia grandis A.Cunn. ex Hook. fil.",
        "currentName": "Pisonia umbellifera (J. \u0026 G. Forst.) Seem.",
        "isSynonym": true,
        "classificationPath": "Plantae|Tracheophyta|Magnoliopsida|Caryophyllales|Nyctaginaceae|Pisonia|Pisonia umbellifera",
        "dataSourcesNum": 18,
        "dataSourceQuality": "HasCuratedSources",
        "matchType": "ExactCanonicalMatch",
        "preferredResults": [
          {
            "dataSourceId": 11,
            "dataSourceTitle": "GBIF Backbone Taxonomy",
            "nameId": "c331072e-c786-5d82-bc3c-c4ff938d6250",
            "name": "Pisonia grandis A.Cunn.",
            "taxonId": "8638411"
          }
        ],
        "retries": 1
      }
dimus commented 4 years ago

Hm, I wonder if the problem comes from the format of DWCA file generated by CoL team. I will try to figure out what is happening with them

dimus commented 4 years ago

I am looking at Catalogue of Life result for resolving Solanum tuberosum and it shows that Solanum tuberosum Bert. ex Walp. is an ambiguous synonym of a currently accepted name Solanum etuberosum Lindl. (accepted name). This result seems to correspond to the output you got from gnfinder. Am I missing something?

Adafede commented 4 years ago

http://www.catalogueoflife.org/annual-checklist/2019/search/all/key/solanum+tuberosum/fossil/1/match/1

Look at the 3 first results (in alphabetical order)...indeed you get Bert first, but the entry you want is the second one. The one everyone wants in gnfinder is Solanum tuberosum L. (accepted)

oolonek commented 4 years ago

Link here http://www.catalogueoflife.org/annual-checklist/2019/search/all/key/solanum+tuberosum/fossil/1/match/1

dimus commented 4 years ago

So we need a better way of handling homonyms. May be we need to return all matches to canonical form Solanum tuberosum?

Adafede commented 4 years ago

Yes, exactly. Or even more ideal: if there is an accepted name, return it, (like Solanum tuberosum L. ), if not, return the ambiguous synonyms...and not returning them in alphabetical order which leads to inconsistency.

Thank you very much for your help!

dimus commented 4 years ago

Bad news -- the person who wrote verification service left the project Good news -- we do need a rewrite of his service. Bad news -- it will take a few months. Good news -- we started the rewrite already.

So I think in the new code the behavior should be:

In this particular case it would mean

Solanum tuberosum L.

Solanum tuberosum Bert. ex Walp.

Solanum tuberosum Poepp. ex Walp.

Adafede commented 4 years ago

Yes, I think it would be ideal. Do you agree @oolonek ?

oolonek commented 4 years ago

Yes. Actually I think that the goal of the script is to return the Accepted Name. And there should be only one. So I dont see why other homonyms should be returned ?

Expected behaviour

So to resume I guess that when no botanist names are specified in the input just return the one and unique Accepted Name.

oolonek commented 4 years ago

Bad news -- the person who wrote verification service left the project Good news -- we do need a rewrite of his service. Bad news -- it will take a few months. Good news -- we started the rewrite already.

emotionalrollercoaster

Happy to have your support here !

oolonek commented 4 years ago

For info here are the outputs of the two following gnfinder commands (with and without the 3 token detection:

Commands

gnfinder find -c -l eng -s "1,11"

and

gnfinder find -c -l eng -s "1,11" -t 3

Input

input is a potato 1 = Solanum tuberosum L. input is a potato 2 = Solanum tuberosum Bert. ex Walp. input is a potato 3 = Solanum tuberosum Poepp. ex Walp. input is a classical plain ol' potato = Solanum tuberosum

Outputs

Giving respectively:

gnfound_potato.txt gnfound_t3_potato.txt

Observations:

The -t argument does allows to catch the botanists initials however this doesn't change the outputs.

dimus commented 4 years ago

Connected this issue to https://github.com/gnames/gnames/issues/20

dimus commented 3 years ago

Fixed in v0.12.0 with incorporation of https://verifier.globalnames.org as the verification service:

❯ echo "Solanum tuberosum" |gnfinder -v -f pretty
{
  "metadata": {
    "date": "2021-04-25T20:33:01.625541101-05:00",
    "gnfinderVersion": "v0.11.1-21-gf30f9d0",
    "withBayes": true,
    "withVerification": true,
    "tokensAround": 0,
    "language": "eng",
    "detectLanguage": false,
    "totalWords": 2,
    "totalCandidates": 1,
    "totalNames": 1
  },
  "names": [
    {
      "cardinality": 2,
      "verbatim": "Solanum tuberosum",
      "name": "Solanum tuberosum",
      "oddsLog10": 10.1178674169311,
      "start": 0,
      "end": 17,
      "annotationNomenType": "NO_ANNOT",
      "verification": {
        "inputId": "d70bae59-b6df-5d17-9306-1bda02ede69c",
        "input": "Solanum tuberosum",
        "matchType": "Exact",
        "bestResult": {
          "dataSourceId": 1,
          "dataSourceTitleShort": "Catalogue of Life",
          "curation": "Curated",
          "recordId": "2777000",
          "localId": "cedff92af8e897332c00c5434f3a6528",
          "outlink": "http://www.catalogueoflife.org/annual-checklist/2019/details/species/id/cedff92af8e897332c00c5434f3a6528",
          "entryDate": "2020-06-15",
          "matchedName": "Solanum tuberosum L.",
          "matchedCardinality": 2,
          "matchedCanonicalSimple": "Solanum tuberosum",
          "matchedCanonicalFull": "Solanum tuberosum",
          "currentRecordId": "2777000",
          "currentName": "Solanum tuberosum L.",
          "currentCardinality": 2,
          "currentCanonicalSimple": "Solanum tuberosum",
          "currentCanonicalFull": "Solanum tuberosum",
          "isSynonym": false,
          "classificationPath": "Plantae|Tracheophyta|Magnoliopsida|Solanales|Solanaceae|Solanum|Solanum tuberosum",
          "classificationRanks": "kingdom|phylum|class|order|family|genus|species",
          "classificationIds": "3939764|3942634|3942724|3943023|3943027|4187115|2777000",
          "editDistance": 0,
          "stemEditDistance": 0,
          "matchType": "Exact"
        },
        "dataSourcesNum": 29,
        "curation": "Curated"
      }
    }
  ]
}