gnames / gnfinder

GNfinder finds scientific names in UTF8 texts, PDF files, MS Word/Excel documents, URLs etc.
MIT License
44 stars 5 forks source link

inconsistency with previous results obtained with the same version #61

Closed Adafede closed 3 years ago

Adafede commented 3 years ago

Hi again,

Always me coming with problems!

I was quite surprised because I ran some jobs I already ran (per mistake) and saw the results changed heavily, where entries in GBIF (11), do not seem to change.

A few weeks ago, I ran (for the example):

 echo "Taxus L." | gnfinder find -c -s 11  

and obtained

6\ 7707728\ 194\ 640\ 4863\ 2684902 kingdom\ phylum\ class\ order\ family\ genus  Plantae\ Tracheophyta\ Pinopsida\ Pinales\ Taxaceae\ Taxus  

now, I only get "preferredResults": [ { "dataSourceId": 11, "dataSourceTitle": "GBIF Backbone Taxonomy", "taxonId": "3240108", "matchedName": "Taxus Geoffroy \u0026 Cuvier, 1795", "matchedCardinality": 1, "matchedCanonicalSimple": "Taxus", "matchedCanonicalFull": "Taxus", "currentName": "Meles Brisson, 1762", "currentCardinality": 1, "currentCanonicalSimple": "Meles", "currentCanonicalFull": "Meles", "isSynonym": true, "classificationPath": "Animalia|Chordata|Mammalia|Carnivora|Mustelidae|Meles", "classificationRank": "kingdom|phylum|class|order|family|genus", "classificationIds": "1|44|359|732|5307|2433867", "matchType": "ExactCanonicalMatch" } ],

where, when I run

echo "Taxus L." | gnverify -s 11 -f pretty

I do get the right

"preferredResults": [ { "dataSourceId": 11, "dataSourceTitleShort": "GBIF Backbone Taxonomy", "curation": "AutoCurated", "recordId": "2684902", "entryDate": "2020-05-29", "matchedName": "Taxus L.", "matchedCardinality": 1, "matchedCanonicalSimple": "Taxus", "matchedCanonicalFull": "Taxus", "currentRecordId": "2684902", "currentName": "Taxus L.", "currentCardinality": 1, "currentCanonicalSimple": "Taxus", "currentCanonicalFull": "Taxus", "isSynonym": false, "classificationPath": "Plantae|Tracheophyta|Pinopsida|Pinales|Taxaceae|Taxus", "classificationRanks": "kingdom|phylum|class|order|family|genus", "editDistance": 0, "stemEditDistance": 0, "matchType": "Exact" } ], "dataSourcesNum": 27, "curation": "Curated"

It is somehow problematic and I loved having the taxonIDs with gnfinder, which gnverify does not give...

Your work is amazing, thanks for it, it allows great things! :)

dimus commented 3 years ago

@Adafede currently gnfinder and gnverify use different services. For now gnfinder stays an old Scala-based system https://index.globalnames.org/ while gnverify (Go version) uses new Go-based service with API described at https://app.swaggerhub.com/apis-docs/dimus/gnames/1.0.0 -- this new service is also about 10 times more performant.

Scala code is aging fast, that was the reason to develop a new code base for name resolution/reconciliation. The algorithms are similar but differ. In your example gnverify preferred "Taxus L." because the author match algorithm found a matched authorship (L.) and it pushed the result higher.

TaxonID is still there under the name RecordID. I did not decide if it is a good change or not yet. The reason of a change is that TaxonID creates an idea that the returned result has taxonomic significance, while in reality, result can be taxonomical, nomenclatural, or lexical, depending on the underlying data-source. So I decided to change the field name to RecordID. However, it might be confusing, because people are accustomed to DarwinCore TaxonID (also semantically misused) term.

I am planning to move gnfinder to new system in the first half of January. That should fix https://github.com/gnames/gnfinder/issues/48 like, for example

``echo "Pisonia grandis" |gnverify -f pretty'

Adafede commented 3 years ago

Yes, I saw all your improvements, Great! I also found the record id but the Taxon ids er had before included all parents, as Ranks and names, now it seems it is only one id, I font know if clear enough...i completely agree with your remarks about semantic!

dimus commented 3 years ago

Yes, @Adafede I understood what you are saying, before there was classification path, rank "path" and ID "path". Noone ever told me they use ID path, so I decided to remove it and see if someone will complain. You did complain so I will add it back :D

dimus commented 3 years ago

https://github.com/gnames/gnames/issues/61

dimus commented 3 years ago

I think we can close it now after v0.12.0 is out.

Adafede commented 3 years ago

Indeed, wonderful, thank you!