ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

Error in bold_seq (str_split: subscript out of bounds) #52

Closed bastianegeter closed 5 years ago

bastianegeter commented 5 years ago
Session Info ```r R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] bold_0.8.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.19 rstudioapi_0.8 xml2_1.2.0 magrittr_1.5 usethis_1.4.0 devtools_2.0.1 pkgload_1.0.1 [8] R6_2.3.0 rlang_0.1.6 stringr_1.2.0 plyr_1.8.4 tools_3.3.2 pkgbuild_1.0.2 sessioninfo_1.1.0 [15] cli_1.0.1 withr_2.1.2 remotes_2.0.1 assertthat_0.2.0 digest_0.6.18 rprojroot_1.3-2 httpcode_0.2.0 [22] crayon_1.3.4 processx_3.2.0 callr_3.0.0 base64enc_0.1-3 fs_1.2.6 ps_1.2.0 triebeard_0.3.0 [29] crul_0.6.0 curl_3.2 testthat_2.0.1 memoise_1.1.0 glue_1.3.0 stringi_1.1.6 urltools_1.7.1 [36] desc_1.2.0 backports_1.1.2 prettyunits_1.0.2 reshape_0.8.8 jsonlite_1.5 ```

Hi, I am enjoying playing with this package. Thanks for developing it. When trying this command

bold_arth<-bold_seq(taxon = "Arthropoda")

I get the error

Error in str_split(str_replace(temp[[1]], "\n", "<<<"), "<<<")[[1]][[2]] : 
  subscript out of bounds

This seems to be a different issue to this resolved issue regarding long vectors with bold_seqspec. I see Arthropoda contains a lot of sequences, but it would still be great to be able to download everything in one go.

Any help would be appreciated.

sckott commented 5 years ago

thanks for the report @bastianegeter - having a look

sckott commented 5 years ago

The problem is their server is timing out, plus they don't give back that there was an error in the HTTP response, so we don't know that there was an error in our normal HTTP response processing (so that's something I can ask them to fix).

So we end up with some fasta data, but towards the end there's a bunch of html text that's essentially an error web page.

Timing out here basically means their server has some default setting, or some setting they've set, to timeout after a certain amount of time, which is related to the large size of the request.

If they do fix this on their side, that would probably solve this, but we can't count on that to happen.

Short of them fixing on their side, is there any clear way you could chop up your request into smaller chunks?

bastianegeter commented 5 years ago

Thanks for looking into this. I tested creating a vector of subgroups of Arthropoda and running a loop to return a list for each. This almost works, but the subgroup Insecta also fails (same error), which means doing something similar for Insecta and combining all the lists into a fasta afterwards. Something like:

#list of Arthropoda - Insecta excluded, because it failed
#Tested
Arthropoda<-c("Arachnida", "Branchiopoda", "Cephalocarida", "Chilopoda", "Collembola","Diplopoda","Diplura","Entognatha","Hexanauplia","Ichthyostraca","Malacostraca","Merostomata","Oligostraca_class_incertae_sedis","Ostracoda","Pauropoda","Pentastomida","Protura","Pycnogonida","Remipedia","Symphyla")

fileList<-list()
for (x in Arthropoda){
  fileList2[[x]]<-bold_seq(taxon = x)
}
#do similar for Insecta, followed by combining all lists into fasta, need to first check whether subgroups within Insecta also fail
#Not tested

With some effort I expect this to work, but subgroup names might change every so often, requiring a check of BOLD taxonomy every time. It would be nice if there was a function (or bold_seq option) that could return a vector of daughter taxa of the input taxon, perform bold_seq on each element of that vector, for any failed elements look up the daughter taxa of that element and perform bold_seq on each element of that new vector. The output might be a list of lists that good be combined into a single fasta.

sckott commented 5 years ago

Ideally BOLD would allow us to get children. Unfortunately, BOLD doesn't provide a way to get child taxa given a taxon name or id, but we can use taxize::children with other taxonomic data sources to get children. I probably don't want to import taxize here as taxize imports bold. But can give examples at least. I kinda feel like a function would hide too much. That is, I think it makes sense for users to get a vector of taxa themselvse and then pass to bold_seq so they know what taxa are being queried, which ones are failing, etc.

i'll experiment with some egs

sckott commented 5 years ago

@bastianegeter can you reinstall remotes::install_github("ropensci/bold") and try again? it warns now when there is a server timeout and points to documentation for help - and trims out the sequences before the error message and gives those back, so it's a partial set of sequences.

bastianegeter commented 5 years ago

Works as expected

> bold_seq(taxon = "Arthropoda")->Animals_from_bold_arth 
Warning message:
In bold_seq(taxon = "Arthropoda") :
  the request timed out, see 'If a request times out'
returning partial output

> length(Animals_from_bold_arth)
[1] 5043

just a note that the partial output appears to be based on $id, so it is scattered across many taxa

> str(head(Animals_from_bold_arth))
List of 6
 $ :List of 4
  ..$ id      : chr "ABLCV291-09"
  ..$ name    : chr "Lepidoptera"
  ..$ gene    : chr "ABLCV291-09"
  ..$ sequence: chr "AACTTTATATTTTATTTTTGGCATTTGATCCGGATTAATTGGAACTTCTTTAAGTTTATTAATTCGAGCTGAATTAGGAACTCCTGGGTCTCTTATTGGAGATGATCAAATTTATAATACTATTGTA"| __truncated__
 $ :List of 4
  ..$ id      : chr "ACGAZ1178-12"
  ..$ name    : chr "Hymenoptera"
  ..$ gene    : chr "ACGAZ1178-12"
  ..$ sequence: chr "--AATNNTATATTTTCTATTTGGTTTANGNTCAGGAATATTAGGATTTTCAATAAGTTTAATTATTCGATTAGAATTAGGAACTCCAAAAATATTAATTGGTAATGATCAAATTTATAATAGAATTG"| __truncated__
 $ :List of 4
  ..$ id      : chr "AGAKJ1078-17"
  ..$ name    : chr "Malloewia abdominalis"
  ..$ gene    : chr "AGAKJ1078-17"
  ..$ sequence: chr "AACATTATATTTTATATTTGGAGCATGAGCTGGAATAGTCGGAACATCATTAAGAATTTTAATTCGAGCTGAATTAGGACACCCAGGAGCTCTAATTGGAGATGATCAAATTTATAATGTAATTGTT"| __truncated__
 $ :List of 4
  ..$ id      : chr "ASIMA240-09"
  ..$ name    : chr "Camponotus MG028"
  ..$ gene    : chr "ASIMA240-09"
  ..$ sequence: chr "---ATTTTATATTTTATTTTTGCAATTTGATCAGGACTAATTGGTTCTTCAATAAGAATAATTATCCGATTAGAATTAGGATCCCCTAATTCATTAATCCTCAATGACCAAACTTTTAACTCCATTG"| __truncated__
 $ :List of 4
  ..$ id      : chr "BBLOE319-11"
  ..$ name    : chr "Chionodes gilvomaculella"
  ..$ gene    : chr "BBLOE319-11"
  ..$ sequence: chr "AACTTTATATTTTATTTTTGGTATTTGAGCAGGTATAGTAGGAACATCACTAAGACTTCTAATTCGAGCAGAATTAGGTAATCCAGGATCTCTAATTGGCGATGATCAAATTTATAATACTATTGTA"| __truncated__
 $ :List of 4
  ..$ id      : chr "BBLPA389-10"
  ..$ name    : chr "Pararctia yarrowii"
  ..$ gene    : chr "BBLPA389-10"
  ..$ sequence: chr "AACATTATATTTTATTTTTGGAGTTTGAGCAGGAATAGTAGGATCTTCTTTAAGATTATTAATTCGAGCTGAATTAGGAAATCCTGGCTCTTTTATTGGAGATGATCAAATTTATAATACTATTGTA"| __truncated__
sckott commented 5 years ago

just a note that the partial output appears to be based on $id, so it is scattered across many taxa

not sure I understand the issue. can you clarify?

bastianegeter commented 5 years ago

Not really an issue. I guess I´m just saying that the partial retrieval is formed from many different locations in the taxonomy, rather than running through each daughter taxon until the request times out. I had the thought that if entries were being retrieved taxon by taxon, then one could identify which taxa had been completely downloaded, and so they would need to be downloaded again.

sckott commented 5 years ago

I see. That's too bad. Must be how they have it setup on their end.

sckott commented 5 years ago

Do you think you now have a workflow that is workable? Or do we need to make more changes?

sckott commented 5 years ago

assuming this is good to go