ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

dual set of parents with genus Ormosia #50

Closed dougwyu closed 5 years ago

dougwyu commented 6 years ago

When i bold_identify() and bold_identify_parents() the following sequence, i get two sets of parents, in two sets of columns. The reason is that the genus name for this sequence (Ormosia) is used in both plants and insects. This makes the identification ambiguous (although I know in this case that the sequence is from an insect). Not sure how to fix this except that the plant and insect Ormosia id numbers are different. thus, i'm not sure how bold_identify_parents could get this wrong?

GMGLM411_13 ACTTTATATTTTATTTTTGGGGCATGAGCGGGTATAGTAGGAACTTCCCTAAGAATTTTAATTCGAGCAGAGCTTGGACACCCAGGAGCATTAATTGGTAATGATCAAATTTATAATGTAATTGTTACCGCTCATGCTTTTGTTATAATTTTTTTTATAGTAATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTACCCCTAATATTAGGGGCTCCTGATATAGCTTTTCCTCGAATAAATAATATAAGTTTTTGATTATTGCCCCCTTCTCTTACTCTTCTTTTAGCAAGTAGTTTAATTGAAAACGGGGCTGGAACAGGTTGAACAGTATATCCCCCGCTATCAGCAGGGATTGCTCATGCCGGAGCTTCAGTTGATTTAGCTATTTTTTCTCTTCATTTAGCAGGAGTTTCTTCAATTTTAGGAGCTGTAAATTTTATTACTACAGTAATTAATATACGATCAACAGGAATTACTTTTGATCGTATACCTTTATTTGTTTGAGCTGTAATTATTACTGCTGTTTTATTATTATTATCTCTCCCAGTTTTAGCAGGAGCTATTACTATACTATTAACAGATCGAAATTTTAATACATCATTTTTTGATCCTGCAGGAGGAGGAGACCCTATTTTATATCAACACTTA

sckott commented 6 years ago

thanks, i'll take a look and get back to you

sckott commented 6 years ago

so there's no way to filter searching by taxonomic names, which we do inside bold_identify_parents, with any other attributes, e.g. Kingdom/Phylum to help eliminate this problem where there's duplicate names in different major groups

The only thing I think that can be done is filter after getting the data back. e,g

x <- "ACTTTATATTTTATTTTTGGGGCATGAGCGGGTATAGTAGGAACTTCCCTAAGAATTTTAATTCGAGCAGAGCTTGGACACCCAGGAGCATTAATTGGTAATGATCAAATTTATAATGTAATTGTTACCGCTCATGCTTTTGTTATAATTTTTTTTATAGTAATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTACCCCTAATATTAGGGGCTCCTGATATAGCTTTTCCTCGAATAAATAATATAAGTTTTTGATTATTGCCCCCTTCTCTTACTCTTCTTTTAGCAAGTAGTTTAATTGAAAACGGGGCTGGAACAGGTTGAACAGTATATCCCCCGCTATCAGCAGGGATTGCTCATGCCGGAGCTTCAGTTGATTTAGCTATTTTTTCTCTTCATTTAGCAGGAGTTTCTTCAATTTTAGGAGCTGTAAATTTTATTACTACAGTAATTAATATACGATCAACAGGAATTACTTTTGATCGTATACCTTTATTTGTTTGAGCTGTAATTATTACTGCTGTTTTATTATTATTATCTCTCCCAGTTTTAGCAGGAGCTATTACTATACTATTAACAGATCGAAATTTTAATACATCATTTTTTGATCCTGCAGGAGGAGGAGACCCTATTTTATATCAACACTTA"
res <- bold_identify(x)[[1]]

Then if you use wide option you get back a data.frame that's equal number of rows as the ouput of bold_identify

out <- bold_identify_parents(res, wide = TRUE)[[1]]
identical(NROW(res), NROW(out))

And we can then filter by major group

out[out$phylum == "Magnoliophyta", ]

thoughts?

dougwyu commented 6 years ago

i tried this, in this case, filtering by != "Magnoliophyta" (because my seq is an arthropod), ends up filtering out one of the best matches (similarity == 1, Ormosia, id == GMGLM411-13).

the solution for now will have to be something more complicated like filter(phylum != "Magnoliophyta" && similarity < 0.97), copying the arthropod parental names "to the left" in any rows that remain, and then deleting the extra columns

thanks for looking into this.

doug

sckott commented 6 years ago

Sorry about not having a better solution to this.

Would you mind sharing a brief example of this use case that I can put into the documentation examples for these functions?

dougwyu commented 6 years ago

Example where bold_identify_parents returns an incorrect parental taxon. Ormosia is a genus name in both plant and insect taxonomy.

testseq <- list(ormosia = "ACTTTATATTTTATTTTTGGGGCATGAGCGGGTATAGTAGGAACTTCCCTAAGAATTTTAATTCGAGCAGAGCTTGGACACCCAGGAGCATTAATTGGTAATGATCAAATTTATAATGTAATTGTTACCGCTCATGCTTTTGTTATAATTTTTTTTATAGTAATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTACCCCTAATATTAGGGGCTCCTGATATAGCTTTTCCTCGAATAAATAATATAAGTTTTTGATTATTGCCCCCTTCTCTTACTCTTCTTTTAGCAAGTAGTTTAATTGAAAACGGGGCTGGAACAGGTTGAACAGTATATCCCCCGCTATCAGCAGGGATTGCTCATGCCGGAGCTTCAGTTGATTTAGCTATTTTTTCTCTTCATTTAGCAGGAGTTTCTTCAATTTTAGGAGCTGTAAATTTTATTACTACAGTAATTAATATACGATCAACAGGAATTACTTTTGATCGTATACCTTTATTTGTTTGAGCTGTAATTATTACTGCTGTTTTATTATTATTATCTCTCCCAGTTTTAGCAGGAGCTATTACTATACTATTAACAGATCGAAATTTTAATACATCATTTTTTGATCCTGCAGGAGGAGGAGACCCTATTTTATATCAACACTTA")

boldoutput_public <- bold_identify(testseq, db="COX1")

boldoutput_public_parents <- bold_identify_parents(boldoutput_public, wide = TRUE)

boldoutput_public_parents.df <- purrr::map_dfr(boldoutput_public_parents, data.frame) %>% dplyr::arrange(desc(similarity)) # map_dfr is equivalent to plyr::ldply, binding all list element outputs into a data frame.

View(boldoutput_public_parents.df) # top row is a similarity = 100% hit for this OTU, but the bold_identify_parents returned both a plant and an insect set of parents. the correct set is insecta.

boldoutput_public_parents2.df <- dplyr::filter(boldoutput_public_parents.df, phylum == "Arthropoda") # filtering out non-Arthropoda parents inadvertently filters out the top hit for the Ormosia.

Maybe there can be a function to switch parents? or to favour one higher taxon when calling bold_identify_parents()?

sckott commented 6 years ago

thanks much @dougwyu for this

what do you mean by "switch parents"

dougwyu commented 6 years ago

"switch parents"

since the correct parent of an ambiguously named genus cannot be chosen automatically, maybe the user can be given a simple function to choose which set of parent taxa is correct for any given row. Thus, in my example, i would run a command to switch from the plant parents to the insect parents of Ormosia.

basically, i see this as swapping sets of columns within a user-chosen row.

sckott commented 6 years ago

sorry for the delay on this.

i would run a command to switch from the plant parents to the insect parents of Ormosia

do you already have a function/some code you run to do this?

dougwyu commented 6 years ago

sorry, i haven't written anything.

sckott commented 6 years ago

@dougwyu bold_identify_parents just gets parent information and joins it to the input data. So I don't see it as a good place to filter by certain names/ranks/ids.

Doesn't dplyr::filter(boldoutput_public_parents.df, phylum == "Arthropoda") do what one would want by filtering out plants from animals?

filtering out non-Arthropoda parents inadvertently filters out the top hit for the Ormosia.

but I assume the major use case is that you know you are working with animals or plants, and so that is fine?

dougwyu commented 6 years ago

@sckott the top hit (GMGLM411-13) returns two sets of columns after this command:

boldoutput_public_parents.df <- purrr::map_dfr(boldoutput_public_parents, data.frame) %>% dplyr::arrange(desc(similarity))

the first set of columns is Ormosia's parents as a plant (Magnoliophyta), and the second set of columns is Ormosia's parents as an arthropod (starting with column phylum_id.1)

so a filter() command loses the entire row, meaning that the correct parent information is also lost.

sckott commented 5 years ago

Sorry for the long wait on this. @dougwyu

can you reinstall remotes::install_github("ropensci/bold"), restart R and try again? this should help:

bold_identify_parents(x, tax_division = "Animals")

let me know

dougwyu commented 5 years ago

Works! thanks very much, Scott.

sckott commented 5 years ago

glad it works