ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

Trouble downloading specimen + sequence data from BOLD with bold_seqspec() in case of either improper quoting or missing fields #105

Open mdogniez opened 2 months ago

mdogniez commented 2 months ago

Hi,

I'm starting to work on a DNA metabarcoding project and I was following the amazing tutorial from Devon O'rourke to build my COI reference library from BOLD (https://forum.qiime2.org/t/building-a-coi-database-from-bold-references/16129) for the QIIME2 pipeline.

When dowloading data with the bold_seqspec() function, my progress was halted several times by the two following errors :

> other_acti_list <- lapply(other_acti_names, bold_seqspec)
Avis : Found and resolved improper quoting out-of-sample. First healed line 202543: <<HEEN006-18    AI-1812.1   8741765     AI-1812 University of Colorado, Boulder     BOLD:ADM2338    18  Chordata    77  Actinopterygii  243 Cypriniformes   775028  Botiidae    86731   Botiinae    106096  Botia                   Jake Lowenstein                         Richard and Jake                "Salween River, Thai/Myanmar Border" Aqua Imports, Boulder, CO michaeltuccinardi@gmail.com          Adult                                                               Thailand    Mae Hong Son    Thai/Myanmar Border     Salween River   3312717|3312718|3312716 http://www.boldsystems.org/pics/HEEN/EBIO_4460_2018_IMG_24362018-05-07+1526073900.J>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
> other_acti_list <- lapply(other_acti_names, bold_seqspec)
Avis : Stopped early on line 213843. Expected 80 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<<style type="text/css">>>

My intial (and rudimentary) solution was to just exclude manually the problematic sequence records, as most of them are not relevant for my study anyway. However, when progressing through the different phyla contained in the BOLD database, I realised that these errors were way to frequent to keep doing that manually.

Would there be a way to go past these errors, so that I can proceed with an automatic download of all my sequences ?

Thanks in advance !

PS: sorry if it's a naive question, I'm very new on this topic, and in bioinformatics in general

Session Info ```r R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22631) Matrix products: default locale: [1] LC_COLLATE=French_Belgium.utf8 LC_CTYPE=French_Belgium.utf8 LC_MONETARY=French_Belgium.utf8 [4] LC_NUMERIC=C LC_TIME=French_Belgium.utf8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readxl_1.4.3 tibble_3.2.1 tidyr_1.3.0 refdb_0.1.1 dplyr_1.1.3 taxize_0.9.100 bold_1.3.0 loaded via a namespace (and not attached): [1] pkgload_1.3.3 jsonlite_1.8.7 foreach_1.5.2 shiny_1.7.5 triebeard_0.4.1 urltools_1.7.3 [7] cellranger_1.1.0 remotes_2.4.2.1 yaml_2.3.7 ggrepel_0.9.5 sessioninfo_1.2.2 pillar_1.9.0 [13] lattice_0.20-45 glue_1.6.2 uuid_1.1-1 digest_0.6.31 promises_1.2.0.1 colorspace_2.1-0 [19] cowplot_1.1.3 htmltools_0.5.4 httpuv_1.6.11 pkgconfig_2.0.3 devtools_2.4.5 ggspatial_1.1.9 [25] httpcode_0.3.0 purrr_1.0.1 xtable_1.8-4 scales_1.2.1 processx_3.8.2 later_1.3.1 [31] proxy_0.4-27 generics_0.1.3 ggplot2_3.4.3 usethis_2.2.2 ellipsis_0.3.2 cachem_1.0.6 [37] withr_2.5.0 cli_3.6.0 magrittr_2.0.3 crayon_1.5.2 mime_0.12 ps_1.7.2 [43] memoise_2.0.1 evaluate_0.21 fs_1.5.2 fansi_1.0.4 nlme_3.1-160 xml2_1.3.5 [49] class_7.3-20 pkgbuild_1.4.2 profvis_0.3.8 prettyunits_1.2.0 tools_4.2.2 data.table_1.14.8 [55] lifecycle_1.0.3 stringr_1.5.0 munsell_0.5.0 callr_3.7.3 compiler_4.2.2 e1071_1.7-13 [61] rlang_1.1.1 classInt_0.4-10 units_0.8-4 grid_4.2.2 conditionz_0.1.0 iterators_1.0.14 [67] rstudioapi_0.15.0 htmlwidgets_1.6.2 miniUI_0.1.1.1 rmarkdown_2.25 gtable_0.3.4 codetools_0.2-18 [73] DBI_1.1.3 curl_4.3.3 R6_2.5.1 zoo_1.8-12 knitr_1.44 fastmap_1.1.1 [79] utf8_1.2.3 KernSmooth_2.23-20 ape_5.7-1 stringi_1.7.8 parallel_4.2.2 crul_1.4.0 [85] Rcpp_1.0.11 vctrs_0.6.2 sf_1.0-16 urlchecker_1.0.1 tidyselect_1.2.0 xfun_0.40 [91] coda_0.19-4 ```
salix-d commented 2 months ago

Hi!

Could you give me a couple of the taxa that had problematic records so I can replicate the errors and try to figure it out?

On Wed, May 8, 2024, 10:51 a.m. mdogniez @.***> wrote:

Hi,

I'm starting to work on a DNA metabarcoding project and I was following the amazing tutorial from Devon O'rourke to build my COI reference library from BOLD ( https://forum.qiime2.org/t/building-a-coi-database-from-bold-references/16129) for the QIIME2 pipeline.

When dowloading data with the bold_seqspec() function, my progress was halted several times by the two following errors :

other_acti_list <- lapply(other_acti_names, bold_seqspec)Avis : Found and resolved improper quoting out-of-sample. First healed line 202543: <<HEEN006-18 AI-1812.1 8741765 AI-1812 University of Colorado, Boulder BOLD:ADM2338 18 Chordata 77 Actinopterygii 243 Cypriniformes 775028 Botiidae 86731 Botiinae 106096 Botia Jake Lowenstein Richard and Jake "Salween River, Thai/Myanmar Border" Aqua Imports, Boulder, CO @.*** Adult Thailand Mae Hong Son Thai/Myanmar Border Salween River 3312717|3312718|3312716 http://www.boldsystems.org/pics/HEEN/EBIO_4460_2018_IMG_24362018-05-07+1526073900.J>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

other_acti_list <- lapply(other_acti_names, bold_seqspec)Avis : Stopped early on line 213843. Expected 80 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<