ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

`marker_code` or `markercode`... getting the "marker = ..." to function as expected #59

Closed devonorourke closed 4 years ago

devonorourke commented 5 years ago

Modifying from the example on this repo, let's pull a tiny dataset of two Arthropod classes and see if we can extract just our COI-5P sequences (and filter out the non COI-5P markers):

library("taxize")
library("bold")

x <- downstream("Arthropoda", db = "ncbi", downto = "class")
x.nms <- x$Arthropoda$childtaxa_name
x.checks <- bold_tax_name(x.nms)

So far so good, everything present. I'm going to select just two small non-Insect arthropod classes from that x.checks list:

tinylist <- c("Merostomata", "Cephalocarida")

Now let's apply that tiny list with bold_seqspec (not bold_seq like in Readme):

out <- lapply(tinylist, bold_seqspec)
out.df <- do.call(rbind.data.frame, out)

Pro: works great! Note, however, that there are several non-COI records in the markercode column. I didn't filter these out in the above argument, so that's okay!

unique(out.df$markercode)
##  >  [1] "COI-5P" ""       "ND3"    "ND6"    "ND1"    "COXIII" "COII"   "ND4"    "ND5-0"  "ND2"    "ND4L"   "CYTB"  

Maybe we can filter these out by passing the marker="COI-5P" argument within the lapply function?

out2 <- lapply(tinylist, bold_seqspec, marker="COI-5P")
out2.df <- do.call(rbind.data.frame, out2)
unique(out2.df$markercode)

Crud, that didn't work.

> unique(out2.df$markercode)
 [1] "COI-5P" "ND3"    "ND6"    "ND1"    "COXIII" "COII"   "ND4"    "ND5-0"  "ND2"    "ND4L"   "CYTB" 

I think this is because there are a pair of markercode columns: marker_codes and markercode.

I can filter these things after the fact with something like:

out.df %>% filter(markercode == "COI-5P)

Where am I going wrong? Thanks Scott!

sckott commented 5 years ago

thanks @devonorourke for the report - i'll have a look

sckott commented 5 years ago

so we've run into this before - and there is documentation for it - see https://github.com/ropensci/bold/blob/master/R/bold_seqspec.R#L21-L28

Something isn't quite right on their end and you get back markers you don't ask for. So the only optio is to filter by markercode after you get the data back.

Does that sort this out?

devonorourke commented 5 years ago

Why build all these columns if you can't query every one of them?

I ended up just downloading the entire dataset and filtering after the fact like you suggested.

So, it sorts it out on your end, but it doesn't help me any :) ha

sckott commented 5 years ago

Why build all these columns if you can't query every one of them?

what does that mean? does it mean you want the function to filter the data inside the fxn?

The BOLD team are unresponsive to my contacts about their services so we can't change anything on their end to make e.g, marker queries actually work.

devonorourke commented 5 years ago

This is 100% a comment about Barcode of Life setup, 0% about your R package. What I'm saying is it's weird to me that you can download their specimen information that has something like 70 columns, yet you can only filter a handful of these, right?

What would be great is to be able to apply a filtering function for any of these fields, and not be restricted to just geo or marker etc (I think there are just 7 we can use from their online URL generator).

What if I wanted to be more specific than a country and search by lat/long? Or by date uploaded rather than just by institution? If the data is already in their database, I'm just wondering why it isn't set up on their end to leverage that additional information.

All it means on my end is needing to further filter after downloading the entire dataset, so no big deal, just a big file.