taxize_nexml() identify higher-level taxonomic names automatically?

cboettig commented 8 years ago

Hey @sckott ,

taxize_nexml() does a nice job of getting metadata when the labels are good species names. Would it be possible to extend this to handle names that are higher-order taxonomy? e.g. this nexml file gives otu labels as families I think: https://github.com/ropensci/RNeXML/blob/master/inst/examples/geospiza.xml

e.g.

nex <- nexml_read("https://raw.githubusercontent.com/ropensci/RNeXML/master/inst/examples/geospiza.xml")
taxize_nexml(nex)

sckott commented 8 years ago

That file has mostly specific epithets within Geospiza, and two genera outside of Geospiza (AFAIK) - Pinaroloxias, and Platyspiza

Ideally, before searching, we'd have the fullest name possible, given the data, e.g., Geospiza magnirostris instead of just magnirostris - It doesn't look like the name Geospiza is anywhere in that file though that I can see. other than the file name.

Ranks would be nice to have to make searches faster, but then we'd need the user to specify that in their nex file

We can do searches for higher taxonomic names, but epithets themselves usually don't work out to well. Not sure what to do in that case.

cboettig commented 8 years ago

Thanks! right, looks like the data just isn't precise enough in this case then.

Um, more generally, do higher taxonomic names work? In similar vein, wondering if we should modify the function to return the two additional meta blocks like what @rvosa 's tool does here, specifying whether the name is a Species or some other rank, and specifying what the species is a rdfs:subClassOf.

@rvosa the value of knowing taxonRank is pretty intuitive, but what's a use case where you would also want the subClassOf? I suppose a user could always determine both of these pieces of information from the taxon identifier directly, though I could see that it would often be more convenient to avoid having to make another query.

sckott commented 8 years ago

Higher taxonomic names should work, yes.

Searching for a higher taxonomic name should return rank as well. Getting parent might be another request though.

(res <- get_uid("Platyspiza"))
#> [1] "48887"
#> attr(,"class")
#> [1] "uid"
#> attr(,"match")
#> [1] "found"
#> attr(,"uri")
#> [1] "http://www.ncbi.nlm.nih.gov/taxonomy/48887"

classification(res)
#> $`48887`
#>                    name         rank      id
#> 1    cellular organisms      no rank  131567
...
#> 28        Passeriformes        order    9126
#> 29          Passeroidea  superfamily  175121
#> 30         Fringillidae       family    9133
#> 31          Emberizinae    subfamily   62155
#> 32           Platyspiza        genus   48887
#> 
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "ncbi"

# for the family Fringillidae, assuming that's what you want
as.uid(9133)
#> [1] "9133"
#> attr(,"class")
#> [1] "uid"
#> attr(,"match")
#> [1] "found"
#> attr(,"uri")
#> [1] "http://www.ncbi.nlm.nih.gov/taxonomy/9133"

So we can get the info that way, presumably some smarter version of it though.

I realized just now that get.uid() and related functions could return rank (if available), meaning one more piece of data avail. (meaning one less API call for users that want rank info).

ropensci / RNeXML

taxize_nexml() identify higher-level taxonomic names automatically? #132