Open ctekellogg opened 3 years ago
I'll let @zachary-foster respond on the taxa side of things as he's the maintainer of it.
Is your pr2 data a proper (SQL) database? Or a set of tabular files? Something else? It would be interesting to think about how to let users define their own data source, but it's quite complex since data can be so varied.
hoping it would be just that easy to create a new db endpoint in taxa and then use it in taxize
Unfortunately that's not quite how it works. I wish it was that easy! The taxon_database
object only has the metadata you specify, whereas in taxize for each data source we have a small or sometimes large amount of code to figure out how to fetch data from the data source, then munge that data into usually a data.frame. Am I missing anything Zach that might be a bridge to making it happen?
manuscript suggest these two packages should work well together
I think the "work well together" is with respect to the taxonomic data alone, that is, that data retrieved from data sources in taxize could be handled/managed/filtered with taxa. And (see below) taxa even used within taxize to output taxa objects.
In terms of taxa and taxize integration, the current version of taxize does not use taxa pkg. BUT, the next major release does integrate taxa2 (https://github.com/zachary-foster/taxa2) - hopefully to be on CRAN soonish. In that taxize version we will use taxa2 to construct various objects of taxonomic data.
Hi! So, in response to this question:
Is your pr2 data a proper (SQL) database? Or a set of tabular files? Something else? It would be interesting to think about how to let users define their own data source, but it's quite complex since data can be so varied.
The PR2 database has multiple formats, including a SQL format (https://pr2-database.org/documentation/pr2-sqlite/). Maybe this is what I should have linked to in the taxon_database
command? Or can I leverage this directly in taxize
? I'm not the developer of PR2 - just a user.
hoping it would be just that easy to create a new db endpoint in taxa and then use it in taxize
Unfortunately that's not quite how it works. I wish it was that easy! The
taxon_database
object only has the metadata you specify, whereas in taxize for each data source we have a small or sometimes large amount of code to figure out how to fetch data from the data source, then munge that data into usually a data.frame. Am I missing anything Zach that might be a bridge to making it happen?
Oh, I see.
manuscript suggest these two packages should work well together
I think the "work well together" is with respect to the taxonomic data alone, that is, that data retrieved from data sources in taxize could be handled/managed/filtered with taxa. And (see below) taxa even used within taxize to output taxa objects.
Ah, I see. And I am sort of wanting it to go the other direction - make a database that taxize
likes and then search it. I thought I had to do this using taxa
(or the upcoming taxa2
)...but maybe not?
In terms of taxa and taxize integration, the current version of taxize does not use taxa pkg. BUT, the next major release does integrate taxa2 (https://github.com/zachary-foster/taxa2) - hopefully to be on CRAN soonish. In that taxize version we will use taxa2 to construct various objects of taxonomic data.
Thanks for brainstorming about this with me! Colleen
@ctekellogg thanks for your responses.
Thanks for the details on the PR2 database. I'll have a look.
Is taxize::classification()
the only taxize
function you are thinking about? Are there other taxize
functions that you'd want to use with PR2?
Am I missing anything Zach that might be a bridge to making it happen?
No, thats right Scott, the taxa
database specification is just to do validity checks on taxon ID and taxon rank objects, so that they do not have invalid IDs/ranks for a given database. The ability to query a database is separate and is handled by taxize
And I am sort of wanting it to go the other direction - make a database that taxize likes and then search it.
If you have a local database, you should be able to parse it with taxa
or just use it as a named character vector if it is a fasta file, or a data.frame/tibble if it is in a tabular format. It looks like the PR2 is released as an R package now (which is pretty cool):
devtools::install_github("vaulot/pr2database")
library(pr2database)
pr2
The table pr2
has all the info for that database, including IDs and classifications. Will that work? I can show you how to convert it into a taxa
object if you need, but that wont make it work with taxize and depending on what you want to do it might not be needed.
Good point about the pr2 package.
I opened an issue in taxize https://github.com/ropensci/taxize/issues/866 - tldr, it's probably not doable but worth discussing at least, as it might be
Thank you both! (sorry for the delay in my response). Yes, I installed the pr2database package in R (before I wrote here), but then was struggling to figure out how I might search it in the manner that taxize
searches databases to output a taxonomy for query (rather than using it to ID a sequence, which is more straightforward), and thus I decided to investigate these two packages. Would love to know how to convert it into a taxa
object.
And yes, @sckott I am primarily interested in using taxize::classification()
function at this time, if that is at all possible.
Thanks again!
searches databases to output a taxonomy for query
can you explain this a bit more? does this correspond to a certain function(s)
Well, since I often work with amplicon sequencing data, my typical mode of operation is to QC the data and the classify the reads against a reference database using sequence classifiers within the QIIME2
platform. I believe you can also do this using some of the tools built into the pr2database
R package, but I often use QIIME2
just because it is what I am familiar with. But in this pipeline you go from an unknown DNA sequence to classified sequence / name. But, for one of my current projects the dataset I am working on has an additional data type (microscopic counts of plankton) for which I don't have sequences to funnel through this bioinformatics pipeline - just species names from a microscopist. But I want the whole taxonomy for each taxa he found, so that I can directly compare with our genomics data. Sure, I can google each or download the taxonomy file for PR2 and use grep or something to get these details but a colleague recently showed me your R packages and I thought perfect, I can go from name to taxonomy (rather than from sequence to taxonomy) in a much more automated way. Ultimately, it works well (thank you!), but the databases in taxa
and taxize
do not tend to employ the most up to date taxonomies for protists so, I thought, why not see if I can get them to search the pr2 database rather than NCBI or Worms, etc.
Seems like it may be a bit of effort to actually make that happen on your end, and I don't want to trouble you too much (since I fully recognize this is a rather specific request), especially if it isn't a feature that would benefit your package overall. So, it is totally okay for you both to say no can do.
Colleen
Thanks for the explanation!
So taxa
I think is still useful for your use case of managing and doing any data munging on taxonomic data associated with any other data for those taxa.
For the checking names part with taxize
, lets continue any discussion over in https://github.com/ropensci/taxize/issues/866 - I do want to do a trial run of users defining custom data sources using pr2 to see if it's feasible. No promises; worth exploring at least
I am just about to leave for camping until Monday. I will get back to this then. Sorry for the delay!
Hi, I recently learned for
taxa
andtaxize
and have been exploring them a bit today, as I am trying to merge microscopy data with an metabarcoding data so that I can compare what is observed by microscopy in the ocean with what we observe via sequencing. I have a list of species and genus names for the microscopy data but for the sequence data, which were annotated with the PR2 database (https://github.com/pr2database/pr2database/releases/tag/v4.12.0), I have a much more detailed taxonomy. I played around withtaxize
today and was able to get it to work for my list of taxa found by our microscopist for the built in databases. But, is it possible to generate ataxa
andtaxize
-friendly version of other publicly available databases, like PR2? I was thinking I could do this using:using the example you provide for NCBI. It seems to work without errors but then this
taxa
version of the PR2 database doesn't work withtaxize
(hoping to useclassification
to list of the taxonomies for our the critters found in our microscopy data)? I get the errorError: the provided db value was not recognised
. Maybe it isn't supposed to, but I guess i was secretly hoping it would be just that easy to create a new db endpoint intaxa
and then use it intaxize
. Or, probably more likely, I am missing some steps in between, since yourtaxa
manuscript suggest these two packages should work well together.Thank you so much for any insight you can provide. I'd really love to streamline this data merging process and I think your package(s) do exactly what I need...but just don't have the databases I need.
Thank you! Colleen