ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

bold_seqspec() times out for small result set #76

Closed pieterprovoost closed 9 months ago

pieterprovoost commented 3 years ago

bold_seqspec() times out while fetching COI sequences for a family that only has a handful of records. This only happens when marker is specified, without marker it returns NA.

bold_seqspec("Acanthaspidiidae", marker = "COI-5P", verbose = TRUE)
* Found bundle for host v4.boldsystems.org: 0x7f8e42cba7c0 [can multiplex]
* Re-using existing connection! (#17) with host v4.boldsystems.org
* Connected to v4.boldsystems.org (131.104.63.11) port 443 (#17)
* Using Stream ID: 7 (easy handle 0x7f8e4e59be00)
> GET /index.php/API_Public/combined?taxon=Acanthaspidiidae&marker=COI-5P&combined_download=tsv HTTP/2
Host: v4.boldsystems.org
User-Agent: libcurl/7.64.1 r-curl/4.3 crul/1.1.0
Accept-Encoding: gzip, deflate
Accept: application/json, text/xml, application/xml, */*

< HTTP/2 200 
< server: nginx
< date: Tue, 02 Mar 2021 23:32:30 GMT
< content-type: application/x-download
< x-powered-by: PHP/5.3.15
< content-disposition: attachment; filename=bold_data.txt
* Added cookie https="on" for domain v4.boldsystems.org, path /, expire 1614731557
< set-cookie: https=on;Path=/;Max-Age=3600;httponly;SameSite=Lax
< 
* Connection #17 to host v4.boldsystems.org left intact
Error in bold_seqspec("Acanthaspidiidae", marker = c("COI-5P"), verbose = TRUE) : 
  BOLD servers returned an error - we're not sure what happened
 try a smaller query - or open an issue and we'll try to help

The taxon name exists in the system:

bold_tax_name("Acanthaspidiidae")
   taxid            taxon tax_rank tax_division parentid parentname specimenrecords
1 175050 Acanthaspidiidae   family     Animalia      330    Isopoda              19
          representitive_image.image representitive_image.apectratio            input
1 NOISO/ZMBN_106486_1+1493040994.jpg                           1.499 Acanthaspidiidae

Upon closer inspection it seems that the API returns a few thousand unrelated records before appending a bunch of HTML containing the error message HTTP_Request2_MessageException: Request timed out due to default_socket_timeout php.ini setting.

I realize this is an API issue and not a package issue, but I was hoping someone knows a workaround.

Session Info ```r R version 4.0.2 (2020-06-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS 10.16 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] glue_1.4.2 dplyr_1.0.4 stringr_1.4.0 rvest_0.3.6 xml2_1.3.2 taxize_0.9.99 bold_1.1.0 loaded via a namespace (and not attached): [1] httr_1.4.2 tidyr_1.1.2 jsonlite_1.7.2 foreach_1.5.1 shiny_1.6.0 [6] assertthat_0.2.1 triebeard_0.3.0 urltools_1.7.3 selectr_0.4-2 yaml_2.2.1 [11] pillar_1.5.0 lattice_0.20-41 uuid_0.1-4 digest_0.6.27 promises_1.2.0.1 [16] colorspace_2.0-0 htmltools_0.5.1.1 httpuv_1.5.5 plyr_1.8.6 pkgconfig_2.0.3 [21] httpcode_0.3.0 bookdown_0.21 purrr_0.3.4 xtable_1.8-4 scales_1.1.1 [26] later_1.1.0.1 mapedit_0.6.0 tibble_3.1.0 generics_0.1.0 ggplot2_3.3.3 [31] ellipsis_0.3.1 cli_2.3.1 magrittr_2.0.1 crayon_1.4.1 mime_0.10 [36] evaluate_0.14 fansi_0.4.2 nlme_3.1-152 class_7.3-18 tools_4.0.2 [41] caspr_0.0.1 data.table_1.14.0 lifecycle_1.0.0 munsell_0.5.0 compiler_4.0.2 [46] e1071_1.7-4 rlang_0.4.10 classInt_0.4-3 units_0.7-0 grid_4.0.2 [51] conditionz_0.1.0 rstudioapi_0.13 iterators_1.0.13 robis_2.4.0 htmlwidgets_1.5.3 [56] crosstalk_1.1.1 rmarkdown_2.7 gtable_0.3.0 codetools_0.2-18 DBI_1.1.1 [61] reshape_0.8.8 curl_4.3 R6_2.5.0 zoo_1.8-8 knitr_1.31 [66] fastmap_1.1.0 utf8_1.1.4 KernSmooth_2.23-18 ape_5.4-1 stringi_1.5.3 [71] rmdformats_1.0.1 parallel_4.0.2 crul_1.1.0 Rcpp_1.0.6 vctrs_0.3.6 [76] sf_0.9-7 leaflet_2.0.4.1 tidyselect_1.1.0 xfun_0.21 ```
sckott commented 3 years ago

Thanks for the issue @pieterprovoost This fxn isn't working for me with any requests. It's possible this is a temporary problem, but hard to know. BOLD has an unreliable backend setup. Plus they do not respond to questions at all AFAICT. So we're left to deal with it on our end by ourselves.

Before when this has happened its a query with a lot of results, and I've told users to split up the query into smaller ones https://github.com/ropensci/bold/issues/29

However, if the example query should not give a lot of results then I'm not sure what else to try. They do have dumps http://v4.boldsystems.org/index.php/datarelease but seems latest was in 2015.

salix-d commented 2 years ago

This happens because there are no public records in the database.

The bold_tax_name() function only return taxonomic records, so all it says is that there are 19 specimen in their taxonomy browser that are from this family.

If you use bold_stats() which checks the record in their public database :

bold_stats("Acanthaspidiidae")

you get :

$total_records
[1] 0

$records_with_species_name
[1] 0

$bins
$bins$count
[1] 0

$countries
$countries$count
[1] 0

$depositories
$depositories$count
[1] 0

$order
$order$count
[1] 0

$family
$family$count
[1] 0

$genus
$genus$count
[1] 0

$species
$species$count
[1] 0

That's the reason it returns NA with out the marker.

And the problem when the marker is specified is because their API treats the missing taxon as taxon being NA, so it searchs only for the marker and tries to return ALL the COI-5P markers, hence the few thousand of unrelated records before crashing.

I think a solution could be to check that species queried do have public records first. It could either be a warning in the docs or integrated in the function as a (possibly optionnal) pre-check to avoid these situations.

salix-d commented 9 months ago

solved with the warning in the docs for now. Might add the pre-check function later (will open new issue for that)