ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
271 stars 61 forks source link

score in tnrs output #609

Closed kamapu closed 7 years ago

kamapu commented 7 years ago

I know, it may be just a small detail, but submitting the following query I got apparently no matching with the database for Guizotia schultzii:

sp_list <- tnrs(query=c("Bidens pilosa","Guizotia schultzii","Lactuca kenyensis"),
        source="iPlant_TNRS")
sp_list

In such a case I would expect to have an empty cell for the column matchedname (as in the case of acceptedname) and a value 0 in the column score. The later could be very helpful when deciding to accept or reject the suggested names using the score value as criterion.

sckott commented 7 years ago

thanks for this issue @kamapu

We don't control the web service behind the tnrs() function, but I can pass along the feedback to the maintainers. I've thought about taking it over, but it's php, which I don't know

here's the JSON behind that request http://taxosaurus.org/retrieve/e7c69a0a08604652e390c28b328405f0 can see that there's a missing name in the acceptedName slot - thus the empty slot in the data.frame returned

Looks like different output when using the iplant tnrs website http://tnrs.iplantcollaborative.org/TNRSapp.html , i get

Name matched:      Guizotia schultzii
Name matched source(s):    GCC
Name matched rank:     species
Name score:    1.00
Author matched:    
Author score:      
Overall score:     1.00
Family matched:    
Family score:      
Name matched accepted family:      Asteraceae
Genus matched:     Guizotia
Genus score:       1.00
Specific epithet matched:      schultzii
Specific epithet score:    1.00
Infraspecific rank :       
Infraspecific epithet matched:     
Infraspecific epithet score:       
Infraspecific rank 2:      
Infraspecific epithet 2 matched:       
Infraspecific epithet 2 score:     
Annotations:       
Unmatched terms:       
Taxoxnomic status:     Illegitimate
Accepted name:     Guizotia scabra
Name matched source(s):    GCC
Accepted Name author:      (Vis.) Chiov.
Accepted Name Species:     Guizotia scabra
Accepted Name Family:      Asteraceae
Warnings:   Ambiguous match

I'll ask maintainer about this

kamapu commented 7 years ago

Thank you again. I'll waiting for the outcome of the discussion. By the way, in the displayed summary there is a typo in Taxoxnomic status.

sckott commented 7 years ago

in the displayed summary there is a typo in Taxoxnomic status

what summary?

kamapu commented 7 years ago

Sorry, I was talking about the popup list, the one attached at the end of your previous message:

Name matched:      Guizotia schultzii
Name matched source(s):    GCC
Name matched rank:     species
Name score:    1.00
Author matched:    
Author score:      
Overall score:     1.00
Family matched:    
Family score:      
Name matched accepted family:      Asteraceae
Genus matched:     Guizotia
Genus score:       1.00
Specific epithet matched:      schultzii
Specific epithet score:    1.00
Infraspecific rank :       
Infraspecific epithet matched:     
Infraspecific epithet score:       
Infraspecific rank 2:      
Infraspecific epithet 2 matched:       
Infraspecific epithet 2 score:     
Annotations:       
Unmatched terms:       
Taxoxnomic status:     Illegitimate
Accepted name:     Guizotia scabra
Name matched source(s):    GCC
Accepted Name author:      (Vis.) Chiov.
Accepted Name Species:     Guizotia scabra
Accepted Name Family:      Asteraceae
Warnings:   Ambiguous match
sckott commented 7 years ago

ah, that's from their website, not from taxize

sckott commented 7 years ago

email to maintainer sent

nmatasci commented 7 years ago

There are at least two issues that came into play. I'll try to unpack them:

Difference between the tnrs website and the API calls via taxosaurus.org

As Scott pointed out, when one searches Guizotia schultzii via the tnrs web interface, the results is different from the taxosaurus call. Namely, the web interface finds an accepted synonym whereas the taxosaurus API call returns an empty accepted name. The reason for this discrepancy rests on the fact that the taxosaurus API only queries the tropicos list, not all the sources. The match retrieved via the web interface is from GCC.

Score vs. acceptance status

The tnrs follows a two step process: name scrubbing followed by taxonomic resolution. In the first step, the input string is parsed into its components (family, genus, species, subspecies, authors etc) and each component is then matched to a list of known strings within the appropriate category (aka "known names"). For every component, a score based on the string similarity (Levenshtein distance) is reported. The component scores are then summarized into a single "Overall score", that indicates how close the submitted name is to a "known name" (matched name). The taxonomic status of the matching name has no impact on the score. Only at that point, the tnrs runs the second stage and queries various taxonomic sources to assess the taxonomic status of a "matched name" and, if appropriate, returns whatever name the source considers to be accepted. In this case, you are seeing a score of 1 for the match, indicating that Guizotia schultzii is the best match in the database. However, because the taxonomic status is "No opinion" (according to that particular Tropicos snapshot), the "Accepted name" field is empty.

Versioning

The TNRS relies on imported versions of the underlying databases and unfortunately those tend to get (badly) out of sync from the corresponding live version, which makes it very difficult to track the source of the problem

I'm painfully aware that this is far from an ideal situation and honestly, the TNRS could use a rewrite from the ground up in order to address these and other issues. Unfortunately, I don't have the resources to do so.

sckott commented 7 years ago

@nmatasci thanks for your input!

The match retrieved via the web interface is from GCC

what is GCC?


@kamapu does the answer above clear things up for you?

nmatasci commented 7 years ago

It's one of the curated sources the TNRS uses to resolve valid names, the Global Compositae Checklist. One of the idea behind the TNRS was to be able for a user to rank the order of its sources according to their (subjective) preference, so that "lower quality" sources are only used to resolve names that are not found in the better sources. That said, we had to use a default ranking and we prioritized manually curated, clade specific lists.

sckott commented 7 years ago

thanks for clarification @nmatasci

@kamapu does the answer above clear things up for you?