Closed dougwyu closed 5 years ago
thanks, i'll take a look and get back to you
so there's no way to filter searching by taxonomic names, which we do inside bold_identify_parents
, with any other attributes, e.g. Kingdom/Phylum to help eliminate this problem where there's duplicate names in different major groups
The only thing I think that can be done is filter after getting the data back. e,g
x <- "ACTTTATATTTTATTTTTGGGGCATGAGCGGGTATAGTAGGAACTTCCCTAAGAATTTTAATTCGAGCAGAGCTTGGACACCCAGGAGCATTAATTGGTAATGATCAAATTTATAATGTAATTGTTACCGCTCATGCTTTTGTTATAATTTTTTTTATAGTAATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTACCCCTAATATTAGGGGCTCCTGATATAGCTTTTCCTCGAATAAATAATATAAGTTTTTGATTATTGCCCCCTTCTCTTACTCTTCTTTTAGCAAGTAGTTTAATTGAAAACGGGGCTGGAACAGGTTGAACAGTATATCCCCCGCTATCAGCAGGGATTGCTCATGCCGGAGCTTCAGTTGATTTAGCTATTTTTTCTCTTCATTTAGCAGGAGTTTCTTCAATTTTAGGAGCTGTAAATTTTATTACTACAGTAATTAATATACGATCAACAGGAATTACTTTTGATCGTATACCTTTATTTGTTTGAGCTGTAATTATTACTGCTGTTTTATTATTATTATCTCTCCCAGTTTTAGCAGGAGCTATTACTATACTATTAACAGATCGAAATTTTAATACATCATTTTTTGATCCTGCAGGAGGAGGAGACCCTATTTTATATCAACACTTA"
res <- bold_identify(x)[[1]]
Then if you use wide
option you get back a data.frame that's equal number of rows as the ouput of bold_identify
out <- bold_identify_parents(res, wide = TRUE)[[1]]
identical(NROW(res), NROW(out))
And we can then filter by major group
out[out$phylum == "Magnoliophyta", ]
thoughts?
i tried this, in this case, filtering by != "Magnoliophyta" (because my seq is an arthropod), ends up filtering out one of the best matches (similarity == 1, Ormosia, id == GMGLM411-13).
the solution for now will have to be something more complicated like filter(phylum != "Magnoliophyta" && similarity < 0.97), copying the arthropod parental names "to the left" in any rows that remain, and then deleting the extra columns
thanks for looking into this.
doug
Sorry about not having a better solution to this.
Would you mind sharing a brief example of this use case that I can put into the documentation examples for these functions?
Example where bold_identify_parents returns an incorrect parental taxon. Ormosia is a genus name in both plant and insect taxonomy.
testseq <- list(ormosia = "ACTTTATATTTTATTTTTGGGGCATGAGCGGGTATAGTAGGAACTTCCCTAAGAATTTTAATTCGAGCAGAGCTTGGACACCCAGGAGCATTAATTGGTAATGATCAAATTTATAATGTAATTGTTACCGCTCATGCTTTTGTTATAATTTTTTTTATAGTAATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTACCCCTAATATTAGGGGCTCCTGATATAGCTTTTCCTCGAATAAATAATATAAGTTTTTGATTATTGCCCCCTTCTCTTACTCTTCTTTTAGCAAGTAGTTTAATTGAAAACGGGGCTGGAACAGGTTGAACAGTATATCCCCCGCTATCAGCAGGGATTGCTCATGCCGGAGCTTCAGTTGATTTAGCTATTTTTTCTCTTCATTTAGCAGGAGTTTCTTCAATTTTAGGAGCTGTAAATTTTATTACTACAGTAATTAATATACGATCAACAGGAATTACTTTTGATCGTATACCTTTATTTGTTTGAGCTGTAATTATTACTGCTGTTTTATTATTATTATCTCTCCCAGTTTTAGCAGGAGCTATTACTATACTATTAACAGATCGAAATTTTAATACATCATTTTTTGATCCTGCAGGAGGAGGAGACCCTATTTTATATCAACACTTA")
boldoutput_public <- bold_identify(testseq, db="COX1")
boldoutput_public_parents <- bold_identify_parents(boldoutput_public, wide = TRUE)
boldoutput_public_parents.df <- purrr::map_dfr(boldoutput_public_parents, data.frame) %>% dplyr::arrange(desc(similarity))
# map_dfr is equivalent to plyr::ldply, binding all list element outputs into a data frame.
View(boldoutput_public_parents.df)
# top row is a similarity = 100% hit for this OTU, but the bold_identify_parents returned both a plant and an insect set of parents. the correct set is insecta.
boldoutput_public_parents2.df <- dplyr::filter(boldoutput_public_parents.df, phylum == "Arthropoda")
# filtering out non-Arthropoda parents inadvertently filters out the top hit for the Ormosia.
Maybe there can be a function to switch parents? or to favour one higher taxon when calling bold_identify_parents()?
thanks much @dougwyu for this
what do you mean by "switch parents"
"switch parents"
since the correct parent of an ambiguously named genus cannot be chosen automatically, maybe the user can be given a simple function to choose which set of parent taxa is correct for any given row. Thus, in my example, i would run a command to switch from the plant parents to the insect parents of Ormosia.
basically, i see this as swapping sets of columns within a user-chosen row.
sorry for the delay on this.
i would run a command to switch from the plant parents to the insect parents of Ormosia
do you already have a function/some code you run to do this?
sorry, i haven't written anything.
@dougwyu bold_identify_parents
just gets parent information and joins it to the input data. So I don't see it as a good place to filter by certain names/ranks/ids.
Doesn't dplyr::filter(boldoutput_public_parents.df, phylum == "Arthropoda")
do what one would want by filtering out plants from animals?
filtering out non-Arthropoda parents inadvertently filters out the top hit for the Ormosia.
but I assume the major use case is that you know you are working with animals or plants, and so that is fine?
@sckott the top hit (GMGLM411-13) returns two sets of columns after this command:
boldoutput_public_parents.df <- purrr::map_dfr(boldoutput_public_parents, data.frame) %>% dplyr::arrange(desc(similarity))
the first set of columns is Ormosia's parents as a plant (Magnoliophyta), and the second set of columns is Ormosia's parents as an arthropod (starting with column phylum_id.1)
so a filter() command loses the entire row, meaning that the correct parent information is also lost.
Sorry for the long wait on this. @dougwyu
can you reinstall remotes::install_github("ropensci/bold")
, restart R and try again? this should help:
bold_identify_parents(x, tax_division = "Animals")
let me know
Works! thanks very much, Scott.
glad it works
When i bold_identify() and bold_identify_parents() the following sequence, i get two sets of parents, in two sets of columns. The reason is that the genus name for this sequence (Ormosia) is used in both plants and insects. This makes the identification ambiguous (although I know in this case that the sequence is from an insect). Not sure how to fix this except that the plant and insect Ormosia id numbers are different. thus, i'm not sure how bold_identify_parents could get this wrong?