Closed oolonek closed 3 years ago
Hm, I wonder if the problem comes from the format of DWCA file generated by CoL team. I will try to figure out what is happening with them
I am looking at Catalogue of Life result for resolving Solanum tuberosum
and it shows that Solanum tuberosum Bert. ex Walp.
is an ambiguous synonym of a currently accepted name Solanum etuberosum Lindl. (accepted name)
. This result seems to correspond to the output you got from gnfinder. Am I missing something?
Look at the 3 first results (in alphabetical order)...indeed you get Bert first, but the entry you want is the second one. The one everyone wants in gnfinder is Solanum tuberosum L. (accepted)
So we need a better way of handling homonyms. May be we need to return all matches to canonical form Solanum tuberosum
?
Yes, exactly. Or even more ideal: if there is an accepted name, return it, (like Solanum tuberosum L. ), if not, return the ambiguous synonyms...and not returning them in alphabetical order which leads to inconsistency.
Thank you very much for your help!
Bad news -- the person who wrote verification service left the project Good news -- we do need a rewrite of his service. Bad news -- it will take a few months. Good news -- we started the rewrite already.
So I think in the new code the behavior should be:
In this particular case it would mean
Solanum tuberosum L.
Solanum tuberosum Bert. ex Walp.
Solanum tuberosum Poepp. ex Walp.
Yes, I think it would be ideal. Do you agree @oolonek ?
Yes. Actually I think that the goal of the script is to return the Accepted Name. And there should be only one. So I dont see why other homonyms should be returned ?
So to resume I guess that when no botanist names are specified in the input just return the one and unique Accepted Name.
Bad news -- the person who wrote verification service left the project Good news -- we do need a rewrite of his service. Bad news -- it will take a few months. Good news -- we started the rewrite already.
Happy to have your support here !
For info here are the outputs of the two following gnfinder commands (with and without the 3 token detection:
gnfinder find -c -l eng -s "1,11"
and
gnfinder find -c -l eng -s "1,11" -t 3
input is a potato 1 = Solanum tuberosum L. input is a potato 2 = Solanum tuberosum Bert. ex Walp. input is a potato 3 = Solanum tuberosum Poepp. ex Walp. input is a classical plain ol' potato = Solanum tuberosum
Giving respectively:
gnfound_potato.txt gnfound_t3_potato.txt
The -t argument does allows to catch the botanists initials however this doesn't change the outputs.
Connected this issue to https://github.com/gnames/gnames/issues/20
Fixed in v0.12.0 with incorporation of https://verifier.globalnames.org as the verification service:
❯ echo "Solanum tuberosum" |gnfinder -v -f pretty
{
"metadata": {
"date": "2021-04-25T20:33:01.625541101-05:00",
"gnfinderVersion": "v0.11.1-21-gf30f9d0",
"withBayes": true,
"withVerification": true,
"tokensAround": 0,
"language": "eng",
"detectLanguage": false,
"totalWords": 2,
"totalCandidates": 1,
"totalNames": 1
},
"names": [
{
"cardinality": 2,
"verbatim": "Solanum tuberosum",
"name": "Solanum tuberosum",
"oddsLog10": 10.1178674169311,
"start": 0,
"end": 17,
"annotationNomenType": "NO_ANNOT",
"verification": {
"inputId": "d70bae59-b6df-5d17-9306-1bda02ede69c",
"input": "Solanum tuberosum",
"matchType": "Exact",
"bestResult": {
"dataSourceId": 1,
"dataSourceTitleShort": "Catalogue of Life",
"curation": "Curated",
"recordId": "2777000",
"localId": "cedff92af8e897332c00c5434f3a6528",
"outlink": "http://www.catalogueoflife.org/annual-checklist/2019/details/species/id/cedff92af8e897332c00c5434f3a6528",
"entryDate": "2020-06-15",
"matchedName": "Solanum tuberosum L.",
"matchedCardinality": 2,
"matchedCanonicalSimple": "Solanum tuberosum",
"matchedCanonicalFull": "Solanum tuberosum",
"currentRecordId": "2777000",
"currentName": "Solanum tuberosum L.",
"currentCardinality": 2,
"currentCanonicalSimple": "Solanum tuberosum",
"currentCanonicalFull": "Solanum tuberosum",
"isSynonym": false,
"classificationPath": "Plantae|Tracheophyta|Magnoliopsida|Solanales|Solanaceae|Solanum|Solanum tuberosum",
"classificationRanks": "kingdom|phylum|class|order|family|genus|species",
"classificationIds": "3939764|3942634|3942724|3943023|3943027|4187115|2777000",
"editDistance": 0,
"stemEditDistance": 0,
"matchType": "Exact"
},
"dataSourcesNum": 29,
"curation": "Curated"
}
}
]
}
Observed problem
It looks like the wrong output is returned when matching against the Catalogue of Life.
Example
Example on the entry Solanum tuberosum.
Expected output
Expected output, according to Catalogue of Life web page : Solanum tuberosum (this is indeed the accepted name). See Catalogue of Life output for 'Solanum tuberosum ' query.
Actual output
However, the output of GNFinder is Solanum etuberosum (see below).
Possible explanation
In fact it appears that, when matching Catalogue of Life, the returned entry is the first row. @adafede observed that the entries of Catalogue of Life are in fact ordered, 1 by Rank and 2 by Alphabetical order (in this case Solanum tuberosum Bert. ex Walp. > Solanum tuberosum L. > Solanum tuberosum Poepp. ex Walp.)
Expected behaviour of GNFinder
In these case, first filter by Name status = Accepted name and the return the corresponding output. How could this be done ? Is it doable on the GNFinder side or should it be taken care of at Catalogue of Life ?
Many thanks
Note that this behaviour is observed for a large number of entries. Another example: Pisonia grandis (accepted name) query returns Pisonia umbellifera