ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
265 stars 59 forks source link

Error in finding resolved plant names, differences between rPLant and Taxize? #227

Closed bw4sz closed 10 years ago

bw4sz commented 10 years ago

Comparison of rPLant and Taxize in resolving plant names from Maquipucuna Ecuador

I really love the tools in taxize, and am very eager to try the phylomatic options. For my workflow, i was first using the rPlant ResolveNames fuction because of its nice fuzzy matching of misspelled species (lots of field assistants!). Taxize doesn't seem to have the same level of fuzzy matching, so i decided on a two step process.

First use rPlant's resolve names, then run the outputs of the rPlant search into taxize to check if there are have been taxonomy changes. I'm finding that many records in rPlant, and corroborated online as good species names, are failing to be found within taxize.

As a quick reference here are the eol pages for five sample species http://eol.org/pages/1118538/overview http://eol.org/pages/8767588/overview http://eol.org/pages/1118438/overview http://eol.org/pages/8767461/overview (although ironically i can see that the images for this species are of the wrong genus on eol) http://eol.org/pages/1107632/overview

Also works from another source: http://www.theplantlist.org/tpl/record/kew-84010

require(rPlant)
## Loading required package: rPlant
## Loading required package: rjson
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: seqinr
## Loading required package: knitcitations
## Loading required package: bibtex
## 
## Attaching package: 'knitcitations'
## 
## The following object is masked from 'package:utils':
## 
##     cite
require(taxize)
## Loading required package: taxize
## 
## 
## New to taxize? Tutorial at http://ropensci.org/tutorials/taxize_tutorial.html 
## API key names have changed. Use tropicosApiKey, eolApiKey, ubioApiKey, and pmApiKey in your .Rprofile file. 
## Use suppressPackageStartupMessages() to suppress these startup messages in the future

# Create some test names, some that match, some that disagree, etc. I can
# push a file to github if you would like to see the full naming lists (has
# some ugly NA species).

nam <- c("Heliconia_virginalis", "Guzmania_amplectens", "Macleania_bullata", 
    "psamisia_ulbrichiana", "faramae calyptrata")  # In this list, the first three are good species, the last two are intentionally misspelled ('psammisia', 'faramea')

# First rplant
rPlant_N <- ResolveNames(nam)
## [1] "Heliconia_virginalis"  "Guzmania_amplectens"   "Macleania_bullata"    
## [4] "Psammisia_ulbrichiana" "Faramea_calyptrata"
CompareNames(nam, rPlant_N)
## 
## [1] "psamisia_ulbrichiana  was changed to  Psammisia_ulbrichiana "
## 
## [1] "faramae calyptrata  was changed to  Faramea_calyptrata "
## [1] "2 taxa changed names according to TNRS"
# That was a success.

# Taxize to reflect current taxonomy.
uids <- get_uid(rPlant_N)
## 
## Retrieving data for taxon 'Heliconia_virginalis'
## 
## Not found. Consider checking the spelling or alternate classification
## 
## Retrieving data for taxon 'Guzmania_amplectens'
## 
## Not found. Consider checking the spelling or alternate classification
## 
## Retrieving data for taxon 'Macleania_bullata'
## 
## 
## Retrieving data for taxon 'Psammisia_ulbrichiana'
## 
## 
## Retrieving data for taxon 'Faramea_calyptrata'
## 
## Not found. Consider checking the spelling or alternate classification
classification(uids)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] NA
## 
## [[3]]
##        ScientificName         Rank     UID
## 1  cellular organisms      no rank  131567
## 2           Eukaryota superkingdom    2759
## 3       Viridiplantae      kingdom   33090
## 4        Streptophyta       phylum   35493
## 5      Streptophytina      no rank  131221
## 6         Embryophyta      no rank    3193
## 7        Tracheophyta      no rank   58023
## 8       Euphyllophyta      no rank   78536
## 9       Spermatophyta      no rank   58024
## 10      Magnoliophyta      no rank    3398
## 11    Mesangiospermae      no rank 1437183
## 12     eudicotyledons      no rank   71240
## 13       Pentapetalae      no rank 1437201
## 14           asterids     subclass   71274
## 15           Ericales        order   41945
## 16          Ericaceae       family    4345
## 17      Vaccinioideae    subfamily  217037
## 18         Vaccinieae        tribe  217062
## 19          Macleania        genus   57528
## 20  Macleania bullata      species   57529
## 
## [[4]]
##           ScientificName         Rank     UID
## 1     cellular organisms      no rank  131567
## 2              Eukaryota superkingdom    2759
## 3          Viridiplantae      kingdom   33090
## 4           Streptophyta       phylum   35493
## 5         Streptophytina      no rank  131221
## 6            Embryophyta      no rank    3193
## 7           Tracheophyta      no rank   58023
## 8          Euphyllophyta      no rank   78536
## 9          Spermatophyta      no rank   58024
## 10         Magnoliophyta      no rank    3398
## 11       Mesangiospermae      no rank 1437183
## 12        eudicotyledons      no rank   71240
## 13          Pentapetalae      no rank 1437201
## 14              asterids     subclass   71274
## 15              Ericales        order   41945
## 16             Ericaceae       family    4345
## 17         Vaccinioideae    subfamily  217037
## 18            Vaccinieae        tribe  217062
## 19             Psammisia        genus  180722
## 20 Psammisia ulbrichiana      species  249297
## 
## [[5]]
## [1] NA
# Flags two errors

My ultimate goal is to get a species list of accepted names, such that i can input them into the phylomatic_tree, i'm aiming for a genus level phylogeny, but that's the next question.

As always i appreciate your work, insight, and leadership. The RopenSci community is an incredible project.

sckott commented 10 years ago

Hi @bw4sz . Thanks for getting in touch. It's especially nice to have bug reports/questions on Github where the software is developed.

I'll take a look and get back to you asap.

The phylomatic_tree function is undergoing changes, I'll try to get it fixed up soon

bw4sz commented 10 years ago

Thanks Scott, sorry that was totally rude of me to just to leave a handle,slipped my mind.

Ben Weinstein Dept. of Ecology of Evolution Stony Brook University

Thanks for your quick response. Playing around with it right now, seeing if i can force it to tax genus names, since there is little hope that sequences should exist for these relatively rare tropical plants.

sckott commented 10 years ago

No worries. Handle is just fine with me!

bw4sz commented 10 years ago

If it helps, here is an added twist, the names exist in the global name resolve function, if spelled correctly.

gnr_resolve("Guzmania jaramilloi")
##        submitted_name        matched_name data_source_title score
## 1 Guzmania jaramilloi Guzmania jaramilloi          Freebase 0.988
## 2 Guzmania jaramilloi Guzmania jaramilloi               EOL 0.988
## 3 Guzmania jaramilloi Guzmania jaramilloi     uBio NameBank 0.988
gnr_resolve("Guzmania_jaramilloi")
## Error: arguments imply differing number of rows: 1, 0

get_uid("Guzmania jaramilloi")
## 
## Retrieving data for taxon 'Guzmania jaramilloi'
## 
## Not found. Consider checking the spelling or alternate classification
## [1] NA
## attr(,"match")
## [1] "not found"
## attr(,"class")
## [1] "uid"
get_uid("Guzmania_jaramilloi")
## 
## Retrieving data for taxon 'Guzmania_jaramilloi'
## 
## Not found. Consider checking the spelling or alternate classification
## [1] NA
## attr(,"match")
## [1] "not found"
## attr(,"class")
## [1] "uid"

Thanks!

sckott commented 10 years ago

wrt the gnr_resolve function and get_uid, If you try get_ids you can search many sources like

get_ids("Guzmania jaramilloi", db=c('ncbi','itis','tropicos','eol','col'))
$ncbi
Guzmania jaramilloi 
                 NA 
attr(,"match")
[1] "not found"
attr(,"class")
[1] "uid"

$itis
Guzmania jaramilloi 
                 NA 
attr(,"match")
[1] "found"
attr(,"class")
[1] "tsn"

$tropicos
Guzmania jaramilloi 
            4303343 
attr(,"class")
[1] "tpsid"

$eol
Guzmania jaramilloi 
         "24936055" 
attr(,"class")
[1] "eolid"

$col
Guzmania jaramilloi 
          "9746606" 
attr(,"class")
[1] "colid"

attr(,"class")
[1] "ids"

You can see that tropicos, eol, and col do have that species, so then you could do

id = get_ids("Guzmania jaramilloi", db='col')
classification(id)
$col
$col$`9746606`
          name    rank
1      Plantae Kingdom
2 Tracheophyta  Phylum
3   Liliopsida   Class
4       Poales   Order
5 Bromeliaceae  Family
6     Guzmania   Genus

attr(,"db")
[1] "col"

Make sense?

sckott commented 10 years ago

You should remove underscores from your names, e.g., like:

mynames <- c('name_one', 'name_two')
gsub('_', ' ', mynames)
[1] "name one" "name two"
sckott commented 10 years ago

Note that you can use the function tnrs in taxize to do the same thing as ResolveNames in rPlant like:

tnrs(gsub("_", " ", nam), source_="iPlant_TNRS")

the result

         submittedname          acceptedname    sourceid score           matchedname           annotations
1 Heliconia virginalis  Heliconia virginalis iPlant_TNRS     1  Heliconia virginalis Abalo & G. Morales L.
2  Guzmania amplectens   Guzmania amplectens iPlant_TNRS     1   Guzmania amplectens              L.B. Sm.
4    Macleania bullata     Macleania bullata iPlant_TNRS     1     Macleania bullata                   Yeo
3 psamisia ulbrichiana Psammisia ulbrichiana iPlant_TNRS  0.98 Psammisia ulbrichiana               Hoerold
5   faramae calyptrata    Faramea calyptrata iPlant_TNRS  0.98    Faramea calyptrata           C.M. Taylor
                                    uri
1 http://www.tropicos.org/Name/21500105
2  http://www.tropicos.org/Name/4303223
4 http://www.tropicos.org/Name/12303377
3 http://www.tropicos.org/Name/12300394
5 http://www.tropicos.org/Name/50151859
sckott commented 10 years ago

Here's an example of a genus level phylogeny. Note that using the source tree "smith2011" doesn't work, at least in this case

taxa <- c("Poa", "Phlox", "Helianthus")
tree <- phylomatic_tree(taxa=taxa, storedtree='R20120829', get='POST')
plot(tree)
Phylogenetic tree with 3 tips and 2 internal nodes.

Tip labels:
[1] "poa"        "phlox"      "helianthus"
Node labels:
[1] "poales_to_asterales"   "ericales_to_asterales"

Rooted; no branch lengths.

screenshot 2013-12-23 18 41 58

sckott commented 10 years ago

Hey @bw4sz Did you have any further questions? If not, I'll close this issue.

sckott commented 10 years ago

closing