ojalaquellueva / TNRSapi

API wrapper for TNRS batch application
Other
4 stars 2 forks source link

Long specific epithets prevent genera from being returned #4

Closed bmaitner closed 2 years ago

bmaitner commented 3 years ago

Hey Brad,

Noticed something weird when trying to clean some RMBL data. It seems like if the specific epithet is too long, it causes otherwise matchable genera to break, even when they represent perfect matches. See the following cases, in each pair, the former performs as expected

TNRS::TNRS(taxonomic_names = "Viola nutt",matches = "all") TNRS::TNRS(taxonomic_names = "Viola nuttalium",matches = "all")

TNRS::TNRS(taxonomic_names = "Epidendrum boylei",matches = "all")
TNRS::TNRS(taxonomic_names = "Epidendrum boyleisgreat",matches = "all")

ojalaquellueva commented 3 years ago

@bmaitner There are two issues here. The first is a result of how the TNRS filters partial matches. Relative to the (most likely) correct species "Viola nuttalium", "Viola" has an Overall Score of 0.50. As this is below the default match threshold of 0.53, the match "Viola" is rejected and the result "[No match found]" is displayed. If you lower the match threshold to 0.5 or lower, "Viola" is returns as the best match.

To allow the behavior you are expecting, I would need to apply the match threshold sequentially: first to the full name, then, if no match found, to the immediate parent, and so on, ignoring name components of taxa at lower ranks. Perhaps this could be the default behavior, and the current behavior optional (e.g., filtering="strict").

Thoughts? (Second issue coming up...)

ojalaquellueva commented 3 years ago

@bmaitner A second issue is more puzzling. Why does "Viola" match "Viola nuttallii" (Levenshtein distance=10) at Overall Score=0.50 whereas "Viola nuttalium" (Levenshtein distance=3) does not, even if you lower the match threshold to zero? Of course, Levenshtein distance is only one component of the Overall Score, but still, LD=3 is a way closer match than LD=10. Will post this as a separate issue.

ojalaquellueva commented 3 years ago

Duplicate of #5

ojalaquellueva commented 3 years ago

Duplicate of #6

bmaitner commented 3 years ago

@ojalaquellueva I think the solution you mention (first fit to whole name, then to immediate parent) sounds like a reasonable one.

ojalaquellueva commented 2 years ago

Closed as duplicate of #7