ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

Add parentnames to output of bold_identify() #36

Closed dougwyu closed 7 years ago

dougwyu commented 7 years ago

Hi there,

When I use bold_identify, I get the lowest level taxonomic identification for that sequence (taxonomicidentification field), but it would be very useful if we could get the parentnames for that identification. The bold APIs do provide this information if i use 3 different bold package commands (see below), but i now would have to do some programming in R (not my strength) to insert the parentnames into the bold_identify output table. It seems to me that this would be better done within the bold package, if you fancy it.

also, maybe i have missed something, but i think it would be nicer to get parentnames from a taxid, not a taxonomicidentification field, given the (small) possibility for ambiguity.

thanks, doug

library(bold)
library(plyr)
testseq <- list(eb4909 = "GAATAAATAATATAAGATTTTGATTACTCCCTCCTTCTTTATTtttATTAATTTTAAGAAATTTTATTGGAACGGGTGTAGGAACCGGATGAACTTTATATCCTCCTTTATCATCTATTGTTGGACATGATTCACCTTCTGTAGATTTAGGAATTttttCTATCCATATTGCTGGAATTTCCTCAATTATAGGATCAATTAATTTTATTGTTACTATTTTAAATATACacacaAaaaCTCATTCACTAAATTTTCTTCCTTTATTCACATGATCAATTTTAATTACAGCAATTCTTCTTCTGTTATCATTACCAGTTCTTGCAGGAGCAATTACTATACTTCTTACAGATCGAAATCTTAATACATCTTtttttGATCCCGCAGGTGGgggggATCCAATTTTATACCAACACTTATTTT")
boldoutput_public <- bold_identify(testseq, db="COX1_SPECIES_PUBLIC")
boldoutput_public.df <- ldply(boldoutput_public, data.frame)
boldoutput_public_tax_name <- bold_tax_name(name=boldoutput_public.df[3,6]) # for a particular identification (third row)
boldoutput_public_tax_name.parents <- bold_tax_id(id=boldoutput_public_tax_name$taxid, includeTree = TRUE)
sckott commented 7 years ago

So you want parentnames included in the output for the bold_identify function? Or, do you want to have a separate function to get parentnames for some subset (or all) of the output from bold_identify?

also, maybe i have missed something, but i think it would be nicer to get parentnames from a taxid, not a taxonomicidentification field, given the (small) possibility for ambiguity.

I don't follow this. You want to get parentnames using a taxonomic ID rather than taxonomic name? But isn't that what bold already allows you to do?

dougwyu commented 7 years ago

Yes, the convenient thing would be to have parentnames as additional columns in the bold_identify output. In general, i would expect that the taxonomic position of a genus_species would not be obvious to me (e.g. what is the Class/Order/Family of Allomerus octospinosus?). Also, with parentname columns, i would be able to sort output tables by higher ranks (e.g. Insecta, Hymenoptera), which is quite useful.

The reason that I request using taxid, not taxonomicidentification is that sometimes genus names are used in different kingdoms (famously, Anura is a plant and a frog).

Thanks for replying so quickly. I don't seem to be able to set up an email notification that you have replied on github. I'll look around.

sckott commented 7 years ago

The reason that I request using taxid, not taxonomicidentification is that sometimes genus names are used in different kingdoms (famously, Anura is a plant and a frog).

not sure we're on the same page here. is this line of discussion talking about the bold_tax_id function? If so, that does accept a taxonomic ID, which is what you want, correct? or not? If not that fxn, which one are you talking about

sckott commented 7 years ago

for email notifications, perhaps go to this page https://github.com/settings/notifications

sckott commented 7 years ago

@dougwyu i started a new fxn. reinstall like devtools::install_github("ropensci/bold")

see bold_identify_parents() and its examples

let me know what you think

dougwyu commented 7 years ago

Fantastic and thank you!

I successfully ran the new command and was initially confused by all the additional rows, but i see what you've done: each ID is effectively its own little dataframe.

I am thinking that it might be more useful for the output to be wider, such that each of the original hits remains one line. Here is an image of what I'm thinking. It maintains most of the newly added information but allows one to filter, sort, and tally the output more easily.

screen shot 2017-01-05 at 08 47 31

Perhaps the number of returned taxids differs per sequence(?), but it seems fine to settle on a fixed set of taxonomic ranks: phylum, class, order, family, subfamily, genus, species.


ps this is what I ran:

testseq <- list(eb4909 = "GAATAAATAATATAAGATTTTGATTACTCCCTCCTTCTTTATTtttATTAATTTTAAGAAATTTTATTGGAACGGGTGTAGGAACCGGATGAACTTTATATCCTCCTTTATCATCTATTGTTGGACATGATTCACCTTCTGTAGATTTAGGAATTttttCTATCCATATTGCTGGAATTTCCTCAATTATAGGATCAATTAATTTTATTGTTACTATTTTAAATATACacacaAaaaCTCATTCACTAAATTTTCTTCCTTTATTCACATGATCAATTTTAATTACAGCAATTCTTCTTCTGTTATCATTACCAGTTCTTGCAGGAGCAATTACTATACTTCTTACAGATCGAAATCTTAATACATCTTtttttGATCCCGCAGGTGGgggggATCCAATTTTATACCAACACTTATTTT") boldoutput_public <- bold_identify(testseq, db="COX1_SPECIES_PUBLIC") boldoutput_public_parents <- bold_identify_parents(boldoutput_public) boldoutput_public_parents.df <- ldply(boldoutput_public_parents, data.frame)

sckott commented 7 years ago

I thought about using wide format, but thought it made more sense to give back the results as I did with repeated rows for each record.

Data.frame for parents is like

$`Paratergatis longimanus`
   taxid                   taxon  tax_rank tax_division parentid   parentname     taxonrep
1     20              Arthropoda    phylum      Animals        1         <NA>   Arthropoda
2     69            Malacostraca     class      Animals       20   Arthropoda Malacostraca
3    336                Decapoda     order      Animals       69 Malacostraca     Decapoda
4   1541               Xanthidae    family      Animals      336     Decapoda    Xanthidae
5 305321               Zosiminae subfamily      Animals     1541    Xanthidae         <NA>
6 322442            Paratergatis     genus      Animals   305321    Zosiminae         <NA>
7 503362 Paratergatis longimanus   species      Animals   322442 Paratergatis         <NA>

In your eg above you just use two of those columns. When I was thinking wide format, i thought it way to many columns to add if we used all, but I guess if it's just two columns it's more palatable to add those columns.

sckott commented 7 years ago

@dougwyu try it again after reinstalling, see new parameter wide

dougwyu commented 7 years ago

That works great! I have tried with one sequence and with 5 sequences. Exactly what I need (and I suspect many others). Thanks very much Scott.

sckott commented 7 years ago

great