ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

a rare case of a missing Insect Order #60

Closed devonorourke closed 4 years ago

devonorourke commented 5 years ago

I stumbled across this quirk while following the Readme example for pulling all the arthropod data at once from BOLD. I'm guessing this is a rare thing, or perhaps a non thing and I'm just screwing something up, but if not, it seemed worth mentioning:

library("taxize")
library("bold")

x <- downstream("Arthropoda", db = "ncbi", downto = "class")
x.nms <- x$Arthropoda$childtaxa_name
x.checks <- bold_tax_name(x.nms)

Great, all the Classes are there. So far so good.

But because the Insect Order has like 89% of all records, I thought I'd remove them from the subsequent lapply(x.nms, bold_seqspec) call and pull out all the Insects and do those separately. So the next step was to generate a list of all Insect Orders:

y <- downstream("Insecta", db = "ncbi", downto = "order")
y.nms <- y$Insecta$childtaxa_name
y.checks <- bold_tax_name(nms)

Having spent more time staring at Insect Order names than I care to admit, I noticed that one was missing: Psocodea. In the y.checks object you'll notice that 'Psocoptera' is actually the one that is listed as missing, and it's because that name isn't used in the BOLD database but is used in NCBI. The BOLD list of all Insect Orders (here) lists Psocodea as having 42380 records, so it's not a trivial issue. Especially for those bark lice lovers out there... which apparently include the bats I study! If you try a search for Psocoptera it'll come up empty in BOLD.

I think this is one of those weird instances where the superOrder 'Psocodea' is used in BOLD... so the NCBI approach may be screwing up what we're looking for in BOLD sometimes.

Thanks for the consideration!

sckott commented 5 years ago

thanks for the report @devonorourke

not sure what the answer is off the top. I'll poke around and see what I can find.

It'd be great if there was a way to implement taxize::downstream for BOLD, but as far as I can remember, I don't think they have a way to get children of a taxon, which is the basis for making downstream work

sckott commented 4 years ago

it seems like BOLD may follow Catalogue of Life taxonomy - I'm trying to get an answer on this

sckott commented 4 years ago

They're definitely not getting back to me.

They do appear to have children on each of their taxon page's, so we can scrape the names, BUT scraping is super fragile, so i'm somehwat reluctant to put this code in a package. this should work as is:

bold_children_one <- function(id) {  
  x <- crul::HttpClient$new(paste0("https://v4.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=", id))
  res <- x$get()
  res$raise_for_status()
  html <- xml2::read_html(res$parse("UTF-8"))
  nodes <- xml2::xml_find_all(html, '//div[@class = "row"]//div[@class = "ibox float-e-margins"]//ol')
  if (length(nodes) == 0) {
    message("no children found")
    return(tibble::tibble())
  }
  group_nmz <- xml2::xml_find_all(html, '//div[@class = "row"]//div[@class = "ibox float-e-margins"]//lh')
  bb <- lapply(nodes, bold_children_each_node)
  if (length(group_nmz) > 0) {
    lst_nmz <- tolower(gsub("\\([0-9]+\\)|\\s", "", xml2::xml_text(group_nmz)))
    bb <- stats::setNames(bb, lst_nmz)
  }
  return(bb)
}

bold_children_each_node <- function(x) {
  out <- lapply(xml2::xml_find_all(x, ".//a"), function(w) {
    nm <- gsub("\\s\\[[0-9]+\\]$", "", xml2::xml_text(w))
    id <- strextract(xml2::xml_attr(w, "href"), "[0-9]+$")
    data.frame(name = nm, id = id, stringsAsFactors = FALSE)
  })
  tibble::as_tibble(data.table::rbindlist(out))
}

# Osmia (genus): 253 children
bold_children_one(id = 4940)
# Momotus (genus): 3 children
bold_children_one(id = 88899)
# Momotus aequatorialis (species): no children
bold_children_one(id = 115130)
# Osmia sp1 (species): no children
bold_children_one(id = 293378)
# Arthropoda (phylum): 27 children
bold_children_one(id = 82)
# Psocodea (order): 51 children
bold_children_one(id = 737139)
# Megachilinae (subfamily): 2 groups (tribes: 3, genera: 60)
bold_children_one(id = 4962)
# Stelis (species): 78 taxa
bold_children_one(id = 4952)
sckott commented 4 years ago

@devonorourke ^^

devonorourke commented 4 years ago

I'm in support of whatever you advise. Agreed about challenge of webscraping. Anything you need from me?

sckott commented 4 years ago

@devonorourke see https://github.com/ropensci/taxize/issues/817