ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

bold_seqspec function gives a mixed set of markers when using marker argument #49

Closed LunaSare closed 5 years ago

LunaSare commented 6 years ago
Session Info ```r Session info ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- setting value version R version 3.4.1 (2017-06-30) system x86_64, darwin15.6.0 ui AQUA language (EN) collate en_US.UTF-8 tz America/New_York date 2017-10-16 Packages ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- package * version date source ade4 1.7-6 2017-03-23 CRAN (R 3.4.0) ape * 4.1 2017-02-14 CRAN (R 3.4.0) base * 3.4.1 2017-07-07 local BiocGenerics * 0.22.1 2017-10-07 Bioconductor BiocInstaller * 1.26.1 2017-09-01 Bioconductor bold * 0.5.0 2017-07-21 CRAN (R 3.4.1) colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0) compiler 3.4.1 2017-07-07 local crayon 1.3.2 2016-06-28 CRAN (R 3.4.0) crul 0.3.8 2017-06-15 CRAN (R 3.4.0) curl 2.8.1 2017-07-21 CRAN (R 3.4.1) datasets * 3.4.1 2017-07-07 local datelife * 0.2.13 2017-10-13 Github (phylotastic/datelife@ae43b8f) devtools 1.13.3 2017-08-02 CRAN (R 3.4.1) digest 0.6.12 2017-01-27 CRAN (R 3.4.0) fastmatch 1.1-0 2017-01-28 CRAN (R 3.4.0) git2r 0.19.0 2017-07-19 CRAN (R 3.4.1) graphics * 3.4.1 2017-07-07 local grDevices * 3.4.1 2017-07-07 local grid 3.4.1 2017-07-07 local httr 1.3.1 2017-08-20 cran (@1.3.1) igraph 1.1.2 2017-07-21 CRAN (R 3.4.1) ips 0.0-7 2014-11-10 CRAN (R 3.4.0) IRanges * 2.10.5 2017-10-08 Bioconductor jsonlite 1.5 2017-06-01 CRAN (R 3.4.0) lattice 0.20-35 2017-03-25 CRAN (R 3.4.1) magrittr 1.5 2014-11-22 CRAN (R 3.4.0) Matrix 1.2-10 2017-05-03 CRAN (R 3.4.1) memoise 1.1.0 2017-04-21 CRAN (R 3.4.0) methods * 3.4.1 2017-07-07 local nlme 3.1-131 2017-02-06 CRAN (R 3.4.1) parallel * 3.4.1 2017-07-07 local phangorn 2.2.0 2017-04-03 CRAN (R 3.4.0) pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0) plyr 1.8.4 2016-06-08 CRAN (R 3.4.0) quadprog 1.5-5 2013-04-17 CRAN (R 3.4.0) R6 2.2.2 2017-06-17 CRAN (R 3.4.0) Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1) reshape 0.8.6 2016-10-21 CRAN (R 3.4.0) S4Vectors * 0.14.7 2017-10-08 Bioconductor seqinr * 3.4-5 2017-08-01 CRAN (R 3.4.1) stats * 3.4.1 2017-07-07 local stats4 * 3.4.1 2017-07-07 local stringi 1.1.5 2017-04-07 CRAN (R 3.4.0) stringr 1.2.0 2017-02-18 CRAN (R 3.4.0) testthat * 1.0.2 2016-04-23 CRAN (R 3.4.0) tools 3.4.1 2017-07-07 local triebeard 0.3.0 2016-08-04 CRAN (R 3.4.0) urltools 1.6.0 2016-10-17 CRAN (R 3.4.0) utils * 3.4.1 2017-07-07 local withr 2.0.0 2017-07-28 CRAN (R 3.4.1) XML * 3.98-1.9 2017-06-19 CRAN (R 3.4.1) xml2 1.1.1 2017-01-24 CRAN (R 3.4.0) ```

Hi! I've been using bold::bold_seqspec() to search for plant and fungi markers. There appears to be an error with the marker argument, since it will output different types of markers for a single marker query:

library(bold) res <- bold_seqspec(taxon="Arabidopsis", marker="rbcL") res$markercode

[1] "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" "rbcL" [27] "rbcL" "rbcL" "matK" "rbcL" "matK" "rbcL" "rbcL" "rbcL" "matK" "rbcL" "rbcL" "matK" "rbcL" "matK" "matK" "rbcL"

And searching for these markers with blast shows that they correspond to the gene specified in $markercode: which(res$markercode=="rbcL") is rbcL in blast which(res$markercode=="matK") is matK in blast

res2 <- bold_seqspec(taxon="Arabidopsis", marker=c("ITS2")) res2$markercode we get a wide mixture of different markers [1] "ITS2" "rbcLa" "rbcLa" "ITS2" "ITS2" "rbcLa" "rbcLa" "COI-5P" "ITS2" "rbcLa" "ITS2" "rbcLa" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" [21] "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "ITS2" "rbcLa" "matK" "ITS2" "rbcLa" "ITS2" "matK" [41] "rbcLa" "ITS2" "matK" "ITS2" "rbcLa" "ITS2" "matK" "rbcLa" "rbcLa" "ITS2"

res3 <- bold_seqspec(taxon="Arabidopsis", marker=c("matK")) # the same problem res3$markercode [1] "rbcLa" "matK" "matK" "rbcLa" "rbcLa" "matK" "matK" "rbcLa" "matK" "rbcLa" "matK" "matK" "matK" "matK" "matK" "matK" "matK" "matK" "matK" "matK" "matK" "matK" "matK" [24] "matK" "matK" "matK" "matK" "matK" "matK" "rbcL" "matK" "matK" "matK" "rbcL" "rbcLa" "ITS2" "matK" "ITS2" "rbcLa" "matK" "matK" "rbcLa" "ITS2" "matK" "rbcLa" "ITS2" [47] "matK" "rbcL" "rbcL" "matK" "matK" "rbcL" "rbcL" "matK" "rbcLa" "matK"

sckott commented 6 years ago

thanks @LunaSare Will have a look in the morning

sckott commented 6 years ago

I've asked about this

sckott commented 6 years ago

@LunaSare response from BOLD

I have looked into this on our API and it appears to be acting as it should. To explain, the search parameters for this API are record based. These are specimen records with multiple markers sequenced, and thus the API returns all sequences for the records where at least one of the markers matches the search criteria. The Process ID is displayed in the first field in the FASTA header which indicates which sequences are related to each other as they are associated with the same specimen.

Does that make sense?