Closed devonorourke closed 4 years ago
thanks for the report @devonorourke
not sure what the answer is off the top. I'll poke around and see what I can find.
It'd be great if there was a way to implement taxize::downstream
for BOLD, but as far as I can remember, I don't think they have a way to get children of a taxon, which is the basis for making downstream work
it seems like BOLD may follow Catalogue of Life taxonomy - I'm trying to get an answer on this
They're definitely not getting back to me.
They do appear to have children on each of their taxon page's, so we can scrape the names, BUT scraping is super fragile, so i'm somehwat reluctant to put this code in a package. this should work as is:
bold_children_one <- function(id) {
x <- crul::HttpClient$new(paste0("https://v4.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=", id))
res <- x$get()
res$raise_for_status()
html <- xml2::read_html(res$parse("UTF-8"))
nodes <- xml2::xml_find_all(html, '//div[@class = "row"]//div[@class = "ibox float-e-margins"]//ol')
if (length(nodes) == 0) {
message("no children found")
return(tibble::tibble())
}
group_nmz <- xml2::xml_find_all(html, '//div[@class = "row"]//div[@class = "ibox float-e-margins"]//lh')
bb <- lapply(nodes, bold_children_each_node)
if (length(group_nmz) > 0) {
lst_nmz <- tolower(gsub("\\([0-9]+\\)|\\s", "", xml2::xml_text(group_nmz)))
bb <- stats::setNames(bb, lst_nmz)
}
return(bb)
}
bold_children_each_node <- function(x) {
out <- lapply(xml2::xml_find_all(x, ".//a"), function(w) {
nm <- gsub("\\s\\[[0-9]+\\]$", "", xml2::xml_text(w))
id <- strextract(xml2::xml_attr(w, "href"), "[0-9]+$")
data.frame(name = nm, id = id, stringsAsFactors = FALSE)
})
tibble::as_tibble(data.table::rbindlist(out))
}
# Osmia (genus): 253 children
bold_children_one(id = 4940)
# Momotus (genus): 3 children
bold_children_one(id = 88899)
# Momotus aequatorialis (species): no children
bold_children_one(id = 115130)
# Osmia sp1 (species): no children
bold_children_one(id = 293378)
# Arthropoda (phylum): 27 children
bold_children_one(id = 82)
# Psocodea (order): 51 children
bold_children_one(id = 737139)
# Megachilinae (subfamily): 2 groups (tribes: 3, genera: 60)
bold_children_one(id = 4962)
# Stelis (species): 78 taxa
bold_children_one(id = 4952)
@devonorourke ^^
I'm in support of whatever you advise. Agreed about challenge of webscraping. Anything you need from me?
@devonorourke see https://github.com/ropensci/taxize/issues/817
I stumbled across this quirk while following the Readme example for pulling all the arthropod data at once from BOLD. I'm guessing this is a rare thing, or perhaps a non thing and I'm just screwing something up, but if not, it seemed worth mentioning:
Great, all the Classes are there. So far so good.
But because the Insect Order has like 89% of all records, I thought I'd remove them from the subsequent
lapply(x.nms, bold_seqspec)
call and pull out all the Insects and do those separately. So the next step was to generate a list of all Insect Orders:Having spent more time staring at Insect Order names than I care to admit, I noticed that one was missing:
Psocodea
. In they.checks
object you'll notice that 'Psocoptera' is actually the one that is listed as missing, and it's because that name isn't used in the BOLD database but is used in NCBI. The BOLD list of all Insect Orders (here) listsPsocodea
as having 42380 records, so it's not a trivial issue. Especially for those bark lice lovers out there... which apparently include the bats I study! If you try a search forPsocoptera
it'll come up empty in BOLD.I think this is one of those weird instances where the superOrder 'Psocodea' is used in BOLD... so the NCBI approach may be screwing up what we're looking for in BOLD sometimes.
Thanks for the consideration!