Closed bastianegeter closed 5 years ago
thanks for the report @bastianegeter - having a look
The problem is their server is timing out, plus they don't give back that there was an error in the HTTP response, so we don't know that there was an error in our normal HTTP response processing (so that's something I can ask them to fix).
So we end up with some fasta data, but towards the end there's a bunch of html text that's essentially an error web page.
Timing out here basically means their server has some default setting, or some setting they've set, to timeout after a certain amount of time, which is related to the large size of the request.
If they do fix this on their side, that would probably solve this, but we can't count on that to happen.
Short of them fixing on their side, is there any clear way you could chop up your request into smaller chunks?
Thanks for looking into this. I tested creating a vector of subgroups of Arthropoda and running a loop to return a list for each. This almost works, but the subgroup Insecta also fails (same error), which means doing something similar for Insecta and combining all the lists into a fasta afterwards. Something like:
#list of Arthropoda - Insecta excluded, because it failed
#Tested
Arthropoda<-c("Arachnida", "Branchiopoda", "Cephalocarida", "Chilopoda", "Collembola","Diplopoda","Diplura","Entognatha","Hexanauplia","Ichthyostraca","Malacostraca","Merostomata","Oligostraca_class_incertae_sedis","Ostracoda","Pauropoda","Pentastomida","Protura","Pycnogonida","Remipedia","Symphyla")
fileList<-list()
for (x in Arthropoda){
fileList2[[x]]<-bold_seq(taxon = x)
}
#do similar for Insecta, followed by combining all lists into fasta, need to first check whether subgroups within Insecta also fail
#Not tested
With some effort I expect this to work, but subgroup names might change every so often, requiring a check of BOLD taxonomy every time. It would be nice if there was a function (or bold_seq option) that could return a vector of daughter taxa of the input taxon, perform bold_seq on each element of that vector, for any failed elements look up the daughter taxa of that element and perform bold_seq on each element of that new vector. The output might be a list of lists that good be combined into a single fasta.
Ideally BOLD would allow us to get children. Unfortunately, BOLD doesn't provide a way to get child taxa given a taxon name or id, but we can use taxize::children
with other taxonomic data sources to get children. I probably don't want to import taxize here as taxize imports bold. But can give examples at least. I kinda feel like a function would hide too much. That is, I think it makes sense for users to get a vector of taxa themselvse and then pass to bold_seq
so they know what taxa are being queried, which ones are failing, etc.
i'll experiment with some egs
@bastianegeter can you reinstall remotes::install_github("ropensci/bold")
and try again? it warns now when there is a server timeout and points to documentation for help - and trims out the sequences before the error message and gives those back, so it's a partial set of sequences.
Works as expected
> bold_seq(taxon = "Arthropoda")->Animals_from_bold_arth
Warning message:
In bold_seq(taxon = "Arthropoda") :
the request timed out, see 'If a request times out'
returning partial output
> length(Animals_from_bold_arth)
[1] 5043
> str(head(Animals_from_bold_arth))
List of 6
$ :List of 4
..$ id : chr "ABLCV291-09"
..$ name : chr "Lepidoptera"
..$ gene : chr "ABLCV291-09"
..$ sequence: chr "AACTTTATATTTTATTTTTGGCATTTGATCCGGATTAATTGGAACTTCTTTAAGTTTATTAATTCGAGCTGAATTAGGAACTCCTGGGTCTCTTATTGGAGATGATCAAATTTATAATACTATTGTA"| __truncated__
$ :List of 4
..$ id : chr "ACGAZ1178-12"
..$ name : chr "Hymenoptera"
..$ gene : chr "ACGAZ1178-12"
..$ sequence: chr "--AATNNTATATTTTCTATTTGGTTTANGNTCAGGAATATTAGGATTTTCAATAAGTTTAATTATTCGATTAGAATTAGGAACTCCAAAAATATTAATTGGTAATGATCAAATTTATAATAGAATTG"| __truncated__
$ :List of 4
..$ id : chr "AGAKJ1078-17"
..$ name : chr "Malloewia abdominalis"
..$ gene : chr "AGAKJ1078-17"
..$ sequence: chr "AACATTATATTTTATATTTGGAGCATGAGCTGGAATAGTCGGAACATCATTAAGAATTTTAATTCGAGCTGAATTAGGACACCCAGGAGCTCTAATTGGAGATGATCAAATTTATAATGTAATTGTT"| __truncated__
$ :List of 4
..$ id : chr "ASIMA240-09"
..$ name : chr "Camponotus MG028"
..$ gene : chr "ASIMA240-09"
..$ sequence: chr "---ATTTTATATTTTATTTTTGCAATTTGATCAGGACTAATTGGTTCTTCAATAAGAATAATTATCCGATTAGAATTAGGATCCCCTAATTCATTAATCCTCAATGACCAAACTTTTAACTCCATTG"| __truncated__
$ :List of 4
..$ id : chr "BBLOE319-11"
..$ name : chr "Chionodes gilvomaculella"
..$ gene : chr "BBLOE319-11"
..$ sequence: chr "AACTTTATATTTTATTTTTGGTATTTGAGCAGGTATAGTAGGAACATCACTAAGACTTCTAATTCGAGCAGAATTAGGTAATCCAGGATCTCTAATTGGCGATGATCAAATTTATAATACTATTGTA"| __truncated__
$ :List of 4
..$ id : chr "BBLPA389-10"
..$ name : chr "Pararctia yarrowii"
..$ gene : chr "BBLPA389-10"
..$ sequence: chr "AACATTATATTTTATTTTTGGAGTTTGAGCAGGAATAGTAGGATCTTCTTTAAGATTATTAATTCGAGCTGAATTAGGAAATCCTGGCTCTTTTATTGGAGATGATCAAATTTATAATACTATTGTA"| __truncated__
just a note that the partial output appears to be based on $id, so it is scattered across many taxa
not sure I understand the issue. can you clarify?
Not really an issue. I guess I´m just saying that the partial retrieval is formed from many different locations in the taxonomy, rather than running through each daughter taxon until the request times out. I had the thought that if entries were being retrieved taxon by taxon, then one could identify which taxa had been completely downloaded, and so they would need to be downloaded again.
I see. That's too bad. Must be how they have it setup on their end.
Do you think you now have a workflow that is workable? Or do we need to make more changes?
assuming this is good to go
Session Info
```r R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] bold_0.8.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.19 rstudioapi_0.8 xml2_1.2.0 magrittr_1.5 usethis_1.4.0 devtools_2.0.1 pkgload_1.0.1 [8] R6_2.3.0 rlang_0.1.6 stringr_1.2.0 plyr_1.8.4 tools_3.3.2 pkgbuild_1.0.2 sessioninfo_1.1.0 [15] cli_1.0.1 withr_2.1.2 remotes_2.0.1 assertthat_0.2.0 digest_0.6.18 rprojroot_1.3-2 httpcode_0.2.0 [22] crayon_1.3.4 processx_3.2.0 callr_3.0.0 base64enc_0.1-3 fs_1.2.6 ps_1.2.0 triebeard_0.3.0 [29] crul_0.6.0 curl_3.2 testthat_2.0.1 memoise_1.1.0 glue_1.3.0 stringi_1.1.6 urltools_1.7.1 [36] desc_1.2.0 backports_1.1.2 prettyunits_1.0.2 reshape_0.8.8 jsonlite_1.5 ```Hi, I am enjoying playing with this package. Thanks for developing it. When trying this command
bold_arth<-bold_seq(taxon = "Arthropoda")
I get the error
This seems to be a different issue to this resolved issue regarding long vectors with bold_seqspec. I see Arthropoda contains a lot of sequences, but it would still be great to be able to download everything in one go.
Any help would be appreciated.