gnames / gnfinder

GNfinder finds scientific names in UTF8 texts, PDF files, MS Word/Excel documents, URLs etc.
MIT License
44 stars 5 forks source link

Document false positive Pithopus inermis #128

Open Archilegt opened 2 years ago

Archilegt commented 2 years ago

Document false positive Pithopus inermis on page https://www.biodiversitylibrary.org/page/663902 The name does not occur on that page. If we figure out what went wrong maybe we could fix it.

Archilegt commented 2 years ago

Maybe "petiolis inermibus" or a spelling variant is producing the false positive.

dimus commented 2 years ago

I think the related output from gnfinder is this one:

    {
      "cardinality": 2,
      "verbatim": "Petrolus inermis,",
      "name": "Petrolus inermis",
      "oddsLog10": 11.983664170973137,
      "oddsDetails": [
        {
          "feature": "spDict: inSpecies",
          "odds": 8904.045433955427
        },
        {
          "feature": "uniDict: inGenus",
          "odds": 2976.794090112943
        },
        {
          "feature": "uniEnd3: lus",
          "odds": 570.6314549737272
        },
        {
          "feature": "spEnd3: mis",
          "odds": 210.6946910672223
        },
        {
          "feature": "spLen: 7",
          "odds": 3.6025724692203513
        },
        {
          "feature": "uniLen: 8",
          "odds": 0.9606164921956841
        },
        {
          "feature": "abbr: false",
          "odds": 0.8732848865715452
        },
        {
          "feature": "priorOdds: true",
          "odds": 0.1
        }
      ],
      "start": 143,
      "end": 160,
      "annotationNomenType": "NO_ANNOT",
      "verification": {
        "id": "0dbc49e2-b393-5d52-a0be-2b09ce6231fa",
        "name": "Petrolus inermis",
        "cardinality": 2,
        "matchType": "PartialExact",
        "bestResult": {
          "dataSourceId": 181,
          "dataSourceTitleShort": "IRMNG",
          "curation": "Curated",
          "recordId": "urn:lsid:irmng.org:taxname:1391559",
          "entryDate": "2022-06-10",
          "sortScore": 8.67908829458864,
          "matchedName": "Petrolus Rafinesque, 1815",
          "matchedCardinality": 1,
          "matchedCanonicalSimple": "Petrolus",
          "matchedCanonicalFull": "Petrolus",
          "currentRecordId": "urn:lsid:irmng.org:taxname:1391559",
          "currentName": "Petrolus Rafinesque, 1815",
          "currentCardinality": 1,
          "currentCanonicalSimple": "Petrolus",
          "currentCanonicalFull": "Petrolus",
          "isSynonym": false,
          "classificationPath": "Biota|Animalia|Chordata|Vertebrata|Reptilia|Reptilia|Reptilia|Petrolus",
          "classificationRanks": "|Kingdom|Phylum|Subphylum|Class|Order|Family|Genus",
          "classificationIds": "urn:lsid:irmng.org:taxname:1|urn:lsid:irmng.org:taxname:2|urn:lsid:irmng.org:taxname:148|urn:lsid:irmng.org:taxname:11905117|urn:lsid:irmng.org:taxname:1448|urn:lsid:irmng.org:taxname:10544|urn:lsid:irmng.org:taxname:100138|urn:lsid:irmng.org:taxname:1391559",
          "editDistance": 0,
          "stemEditDistance": 0,
          "matchType": "PartialExact",
          "scoreDetails": {
            "cardinalityScore": 0,
            "infraSpecificRankScore": 0,
            "fuzzyLessScore": 1,
            "curatedDataScore": 0.6666667,
            "authorMatchScore": 0.14285715,
            "acceptedNameScore": 1,
            "parsingQualityScore": 1
          }
        },

So looks like Pithopus inermis is not returned from gnfinder.

@mlichtenberg and @cajunjoel can you help to find out how this false positive appeared in BHL?

mlichtenberg commented 2 years ago

It was old data left over from a previous name-finding algorithm. I re-ran that page through the latest version of GNFinder (1.0.0) and the data now reflects the GNFinder output shown in the previous comment (https://www.biodiversitylibrary.org/page/663902).

dimus commented 2 years ago

@mlichtenberg, @cajunjoel, taking into account an imminent approach of bhlindex v1.0.0, may be we should plan to run it in October against whole BHL and get rid of outdated inaccuracies of old algorithms?

Archilegt commented 2 years ago

Recognition of Petrolus is as expected for "Petiolus inermis" sentence in line 5, with underlying uncorrected OCR "Petrolus inermis". There is one less false positive for a centipede name! ;) I will leave the issue open in case that you wish to continue working on it.