ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

long vectors, round 2 #61

Closed devonorourke closed 4 years ago

devonorourke commented 5 years ago

Hi Scott, Trying and failing to download the entire Insect dataset with bold_seqspec. The InsectNames vector was scrubbed from their website directly, but this was my first hack at trying to figure out a programmatic way to pull the Insect Order names directly from the BOLD website. Folks that know what they are doing could probably do this better!

library(taxize)
library(stringr)
library(rvest)
library(tidyverse)

boldurl <- read_html("http://v4.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=82")
boldtext <- boldurl %>% html_nodes("div.col-md-6") %>%  html_text()
tmptext <- substr(boldtext[7], start=71, stop=nchar(boldtext[7]))
tmptext2 <- gsub('[[:digit:]]+', '', tmptext)
tmptext3 <- unlist(strsplit(tmptext2, '\\['))
tmptext4 <- gsub('\\]', '', tmptext3)
tmptext5 <- str_trim(tmptext4)
insectNames <- tmptext5[tmptext5 != ""]
rm(list=ls(pattern = "tmptext"))

With that vector of InsectNames, I then run the bold_seqspec call:

Insects_list <- lapply(insectNames, bold_seqspec)

But unfortunately, it generates this error:

Error in paste0(rawToChar(out$content, multiple = TRUE), collapse = "") : 
  result would exceed 2^31-1 bytes
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string

Next idea up for me is to break up that list into smaller bits, probably one for Lepidopterans, one for Coleopterans, and one for the rest. Thanks for any insights you might offer!

devonorourke commented 5 years ago

Quick follow up: I ended up breaking up the data into six groups total and there wasn't any further issue with memory. The groups were:

  1. All non-insect arthropods
  2. Diptera
  3. Lepidoptera
  4. Hymenoptera
  5. Coleoptera
  6. remaining insects
sckott commented 5 years ago

thanks for this. I think it's a bug related to rawToChar, the string coming from out$result must be very lage

devonorourke commented 5 years ago

Yep. It was all Insect records.

On Tue, Feb 26, 2019, 12:40 PM Scott Chamberlain notifications@github.com wrote:

thanks for this. I think it's a bug related to rawToChar, the string coming from out$result must be very lage

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ropensci/bold/issues/61#issuecomment-467538180, or mute the thread https://github.com/notifications/unsubscribe-auth/AKqgXNZRRAg-kJUKYx5efu2agwI8GZPxks5vRXGEgaJpZM4bN4VV .