creating a taxonomy database for PR2

ctekellogg commented 3 years ago

Hi, I recently learned for taxa and taxize and have been exploring them a bit today, as I am trying to merge microscopy data with an metabarcoding data so that I can compare what is observed by microscopy in the ocean with what we observe via sequencing. I have a list of species and genus names for the microscopy data but for the sequence data, which were annotated with the PR2 database (https://github.com/pr2database/pr2database/releases/tag/v4.12.0), I have a much more detailed taxonomy. I played around with taxize today and was able to get it to work for my list of taxa found by our microscopist for the built in databases. But, is it possible to generate a taxa and taxize-friendly version of other publicly available databases, like PR2? I was thinking I could do this using:

(pr2_taxadb <- taxon_database(
  name = "pr2",
  url = "https://github.com/pr2database/pr2database/releases/tag/v4.12.0",
  description = "PR2 Database",
  id_regex = "*"
))

using the example you provide for NCBI. It seems to work without errors but then this taxa version of the PR2 database doesn't work with taxize (hoping to use classification to list of the taxonomies for our the critters found in our microscopy data)? I get the error Error: the provided db value was not recognised. Maybe it isn't supposed to, but I guess i was secretly hoping it would be just that easy to create a new db endpoint in taxa and then use it in taxize. Or, probably more likely, I am missing some steps in between, since your taxa manuscript suggest these two packages should work well together.

Thank you so much for any insight you can provide. I'd really love to streamline this data merging process and I think your package(s) do exactly what I need...but just don't have the databases I need.

Thank you! Colleen

sckott commented 3 years ago

I'll let @zachary-foster respond on the taxa side of things as he's the maintainer of it.

Is your pr2 data a proper (SQL) database? Or a set of tabular files? Something else? It would be interesting to think about how to let users define their own data source, but it's quite complex since data can be so varied.

hoping it would be just that easy to create a new db endpoint in taxa and then use it in taxize

Unfortunately that's not quite how it works. I wish it was that easy! The taxon_database object only has the metadata you specify, whereas in taxize for each data source we have a small or sometimes large amount of code to figure out how to fetch data from the data source, then munge that data into usually a data.frame. Am I missing anything Zach that might be a bridge to making it happen?

manuscript suggest these two packages should work well together

I think the "work well together" is with respect to the taxonomic data alone, that is, that data retrieved from data sources in taxize could be handled/managed/filtered with taxa. And (see below) taxa even used within taxize to output taxa objects.

In terms of taxa and taxize integration, the current version of taxize does not use taxa pkg. BUT, the next major release does integrate taxa2 (https://github.com/zachary-foster/taxa2) - hopefully to be on CRAN soonish. In that taxize version we will use taxa2 to construct various objects of taxonomic data.

ctekellogg commented 3 years ago

Hi! So, in response to this question:

Is your pr2 data a proper (SQL) database? Or a set of tabular files? Something else? It would be interesting to think about how to let users define their own data source, but it's quite complex since data can be so varied.

The PR2 database has multiple formats, including a SQL format (https://pr2-database.org/documentation/pr2-sqlite/). Maybe this is what I should have linked to in the taxon_database command? Or can I leverage this directly in taxize? I'm not the developer of PR2 - just a user.

hoping it would be just that easy to create a new db endpoint in taxa and then use it in taxize

Unfortunately that's not quite how it works. I wish it was that easy! The taxon_database object only has the metadata you specify, whereas in taxize for each data source we have a small or sometimes large amount of code to figure out how to fetch data from the data source, then munge that data into usually a data.frame. Am I missing anything Zach that might be a bridge to making it happen?

Oh, I see.

manuscript suggest these two packages should work well together

I think the "work well together" is with respect to the taxonomic data alone, that is, that data retrieved from data sources in taxize could be handled/managed/filtered with taxa. And (see below) taxa even used within taxize to output taxa objects.

Ah, I see. And I am sort of wanting it to go the other direction - make a database that taxize likes and then search it. I thought I had to do this using taxa (or the upcoming taxa2)...but maybe not?

In terms of taxa and taxize integration, the current version of taxize does not use taxa pkg. BUT, the next major release does integrate taxa2 (https://github.com/zachary-foster/taxa2) - hopefully to be on CRAN soonish. In that taxize version we will use taxa2 to construct various objects of taxonomic data.

Thanks for brainstorming about this with me! Colleen

sckott commented 3 years ago

@ctekellogg thanks for your responses.

Thanks for the details on the PR2 database. I'll have a look.

Is taxize::classification() the only taxize function you are thinking about? Are there other taxize functions that you'd want to use with PR2?

zachary-foster commented 3 years ago

Am I missing anything Zach that might be a bridge to making it happen?

No, thats right Scott, the taxa database specification is just to do validity checks on taxon ID and taxon rank objects, so that they do not have invalid IDs/ranks for a given database. The ability to query a database is separate and is handled by taxize

And I am sort of wanting it to go the other direction - make a database that taxize likes and then search it.

If you have a local database, you should be able to parse it with taxa or just use it as a named character vector if it is a fasta file, or a data.frame/tibble if it is in a tabular format. It looks like the PR2 is released as an R package now (which is pretty cool):

devtools::install_github("vaulot/pr2database")
library(pr2database)
pr2

The table pr2 has all the info for that database, including IDs and classifications. Will that work? I can show you how to convert it into a taxa object if you need, but that wont make it work with taxize and depending on what you want to do it might not be needed.

sckott commented 3 years ago

Good point about the pr2 package.

I opened an issue in taxize https://github.com/ropensci/taxize/issues/866 - tldr, it's probably not doable but worth discussing at least, as it might be

ctekellogg commented 3 years ago

Thank you both! (sorry for the delay in my response). Yes, I installed the pr2database package in R (before I wrote here), but then was struggling to figure out how I might search it in the manner that taxize searches databases to output a taxonomy for query (rather than using it to ID a sequence, which is more straightforward), and thus I decided to investigate these two packages. Would love to know how to convert it into a taxa object.

And yes, @sckott I am primarily interested in using taxize::classification() function at this time, if that is at all possible.

Thanks again!

sckott commented 3 years ago

searches databases to output a taxonomy for query

can you explain this a bit more? does this correspond to a certain function(s)

ctekellogg commented 3 years ago

Well, since I often work with amplicon sequencing data, my typical mode of operation is to QC the data and the classify the reads against a reference database using sequence classifiers within the QIIME2 platform. I believe you can also do this using some of the tools built into the pr2database R package, but I often use QIIME2 just because it is what I am familiar with. But in this pipeline you go from an unknown DNA sequence to classified sequence / name. But, for one of my current projects the dataset I am working on has an additional data type (microscopic counts of plankton) for which I don't have sequences to funnel through this bioinformatics pipeline - just species names from a microscopist. But I want the whole taxonomy for each taxa he found, so that I can directly compare with our genomics data. Sure, I can google each or download the taxonomy file for PR2 and use grep or something to get these details but a colleague recently showed me your R packages and I thought perfect, I can go from name to taxonomy (rather than from sequence to taxonomy) in a much more automated way. Ultimately, it works well (thank you!), but the databases in taxa and taxize do not tend to employ the most up to date taxonomies for protists so, I thought, why not see if I can get them to search the pr2 database rather than NCBI or Worms, etc.

Seems like it may be a bit of effort to actually make that happen on your end, and I don't want to trouble you too much (since I fully recognize this is a rather specific request), especially if it isn't a feature that would benefit your package overall. So, it is totally okay for you both to say no can do.

Colleen

sckott commented 3 years ago

Thanks for the explanation!

So taxa I think is still useful for your use case of managing and doing any data munging on taxonomic data associated with any other data for those taxa.

For the checking names part with taxize, lets continue any discussion over in https://github.com/ropensci/taxize/issues/866 - I do want to do a trial run of users defining custom data sources using pr2 to see if it's feasible. No promises; worth exploring at least

zachary-foster commented 3 years ago

I am just about to leave for camping until Monday. I will get back to this then. Sorry for the delay!

ropensci / taxa

creating a taxonomy database for PR2 #208