documentation on subsetting, and large data memory support?

nick-youngblut commented 7 years ago

The package description states that a feature of the package is to subset a biom file. Does this include subsetting the entire biom by samples and/or observations or just subsetting the OTU table, taxonomy table, etc.?

I was hoping to do all/most of my current microbiome analyses in R (easy data manipulation with dplyr/tidyr, dataset manipulation with phyloseq, and plotting with ggplot2). However, phyloseq doesn't scale for large datasets (>~1000 samples and >~100000 OTUs), and so I was hoping that I could just use biomformat, but I don't know if it has enough functionality. Maybe for now, I'm stuck with constantly switching between python (both v2.7 & v3.5) and R. Any suggestions would be much appreciated!

jnpaulson commented 7 years ago

I can't answer for Joey on phyloseq, though I am certain he has some good work arounds for the >~1000 samples etc.

That said, metagenomeSeq (+ metavizr) makes use of the eSet (ExpressionSet framework) from Bioconductor which scales nicely for ~1000 samples and over a hundred thousand OTUs and was developed with provenance/reproducibility in mind. As an example...

library(biomformat)
library(metagenomeSeq)
library(msd16s)

msd16s # see the details - this one only has 26k OTUs, 992 samples, but quadrupling the rows is fine...

# First we convert the msd16s data set to biom
x = MRexperiment2biom(msd16s) # longest phase, but not initially required
x # see the biom object
obj = biom2MRexperiment(x)

# your favorite subsetting... e.g.
controls = obj[,pData(obj)$Type=="Control"]

# then either send to phyloseq using Phyloseq's MRexperiment2phyloseq function
# or visualize using metavizr (a new interactive visualization tool). 
# Check out the [vignette](https://bioconductor.org/packages/release/bioc/vignettes/metavizr/inst/doc/IntroToMetavizr.html)

The plotting functions implemented in metagenomeSeq are not as comprehensive as those in phyloseq and one advantage to phyloseq (for some) is the ggplot connectivity, but check out metavizr and this is a FAST workaround so you don't need to use python nor write your own subsetting functions.

nick-youngblut commented 7 years ago

Great! Thanks for the suggestion! I'll give metagenomeSeq + eSet a shot.

While the phyloseq ggplot2 connectivity is great, I often need more flexibility with microbiome analyses. What I'm looking for is just a way to keep an entire dataset together (eg., OTU table, taxonomy table, 16S phylogeny), which can be easily summarized, parsed, and converted to other formats. The biom format is the standard, which is fine, but there seems to be limited support for biom in R (especially relative to python). In the past, I've just saved microbiome datasets as phyloseq R objects, but that method hasn't worked with larger microbiome datasets.

joey711 commented 7 years ago

Were I to have time to refactor phyloseq internals, I would probably have the default internal representation be a sparse (long/pivot/tidy) data.table, which scales really well to tables that are many GB in size.

This is really more a comment on the most vanilla of R matrix classes, base::matrix, which is what phyloseq currently uses for its OTU table, and where much of the large-data limits are arising. R itself is full of solutions for efficiently representing large tables/matrices.

On a practical note, when someone comes to me with a dataset claiming they have 100,000 OTUs, I usually find on further inspection that this is false. Unless you're working on a very diverse system or set of systems, your 100,000 molecular OTUs probably consist of thousands of actual sequence variants, and ~99,000 clusters of noise. See dada2 for an example of how you might avoid this problem before worrying about memory management, and most-importantly with data that is much more interpretable.

If you are in the minority of microbiome researchers working on specimens that are actually that diverse, I apologize that phyloseq/base::matrix is hitting these memory limits. As @jnpaulson indicated, standard Bioconductor infrastructure has also been solving this problem nicely, including with nice tools like DelayedArray, HDF5-backed classes, etc.; typically all working with canonical sets of methods that make it easy(ier) to jump between them.

To answer your question about the biomformat subset method, it should apply the subsetting to all components, similar to how phyloseq behaves. I'd love it if the data class in biomformat becomes good enough to replace the one in phyloseq (although phyloseq might need to extend it to keep the phylo-tree, since the official biom-format hasn't typically supported storing a tree).

I will close for now, but feel free to comment further, especially to post the solution to your problem so others may benefit.

nick-youngblut commented 7 years ago

Thanks @joey711 for the interesting discussion. In regards to large microbiome datasets, I am currently working with datasets containing a high number of samples, which can contribute to the number of OTUs. For example, one full dataset has ~3500 samples (16S rRNA MiSeq data). A stricter QC/denoising should maybe be applied, but I'm guessing that at least 10's of 1000's of the OTUs are real. I haven't done any testing, but phyloseq may not currently scale well with tables of less OTUs, but high numbers of samples (eg., 3000 samples x 10000 OTUs).

I'll definitely post a solution if I figure out a good one.

zachary-foster commented 7 years ago

Hello @nick-youngblut and @joey711! I am one of the authors of the taxa package and the author of metacoder. We are also trying to come up with scalable and intuitive storage/manipulation of large sets of taxonomic classifications associated with arbitrary user-generated data. We hope to make a standard in R that multiple packages can use. The taxmap object defined in the taxa package can store taxonomic classifications and associated data, such as OTU abundance matrices, and subset both the taxonomy and associated data with functions modeled after dplyr.

The largest data set I have used so far was ~5000 samples by ~50,000 OTUs, classified by ~1000 taxa. I have attached some example code using this dataset that you can play with if you like. We are still developing these functions, so I would be happy to hear of any problems you run into and how scalable things are. We are also working on making these classes compatible with phyloseq and the biome format (https://github.com/grunwaldlab/metacoder/issues/141), so this might be an option for subsetting biome files in the future. There is a function in the Github version of metacoder to read the biome output of QIIME, so I might be able to abstract that to biome in general.

Anyway, I just thought I would show what we are working on as another possible option in the future.

devtools::install_github("grunwaldlab/metacoder")
devtools::install_github("ropensci/taxa")
library(metacoder)

## Downloading the data

otu_file_url = "http://downloads.hmpdacc.org/data/HMQCP/otu_table_psn_v35.txt.gz"
mapping_file_url = "http://downloads.hmpdacc.org/data/HMQCP/v35_map_uniquebyPSN.txt.bz2"
download_file <- function(path) {
  temp_file_path <- file.path(tempdir(), basename(path))
  utils::download.file(url = path, destfile = temp_file_path, quiet = TRUE)
  path <- temp_file_path
  return(path)
}
otu_file <- download_file(otu_file_url)
mapping_file <- download_file(mapping_file_url)

## Parsing the data

# Read in raw file content
otu_data <- utils::read.table(otu_file, header = TRUE, skip = 1, comment.char = "",
                              stringsAsFactors = FALSE, sep = "\t")
colnames(otu_data)[1] <- "otu_id"

# Take a look at part of the data
print(otu_data[1:5, c(1:3, ncol(otu_data))])

# Make into a taxmap object
hmp_data <- taxa::parse_tax_data(otu_data,
                                 class_cols = "Consensus.Lineage",
                                 class_regex = "([a-z]{0,1})_{0,2}(.*)$",
                                 class_key = c(hmp_rank = "info", hmp_tax = "taxon_name"),
                                 class_sep = ";")
print(hmp_data)

<Taxmap>
  985 taxa: aab. Root, aac. Firmicutes, aad. Proteobacteria, aae. Bacteroidetes ... blu. , blv. Planifilum, blw. , blx. Dichelobacter
  985 edges: NA->aab, aab->aac, aab->aad, aab->aae, aab->aaf, aab->aag ... akc->blr, anf->bls, aia->blt, aos->blu, akc->blv, aqr->blw, amv->blx
  2 data sets:
    tax_data:
      # A tibble: 45,383 x 4,791
        taxon_id     otu_id X700114607 X700114380 X700114716 X700114798 X700114710 X700114614 X700114755 X700114715 X700114706 X700114613 X700114803
           <chr>      <chr>      <int>      <int>      <int>      <int>      <int>      <int>      <int>      <int>      <int>      <int>      <int>
      1      aqs   OTU_97.1          0          0          0          0          0          0          0          0          0          0          0
      2      aqt  OTU_97.10          0          0          0          0          0          0          0          0          1          0          0
      3      aqu OTU_97.100          0          0          0          0          0          0          0          0          0          0          0
      # ... with 4.538e+04 more rows, and 4778 more variables: X700114714 <int>, X700114482 <int>, X700114439 <int>, X700114658 <int>, X700114387 <int>,
      #   X700114441 <int>, X700114608 <int>, X700111759 <int>, X700114603 <int>, X700106988 <int>, X700113609 <int>, X700114612 <int>, X700110821 <int>,
      #   X700114486 <int>, X700114481 <int>, X700114377 <int>, X700114424 <int>, X700114712 <int>, X700114713 <int>, X700114554 <int>, X700114606 <int>,
      #   X700114335 <int>, X700114707 <int>, X700114442 <int>, X700114610 <int>, X700114437 <int>, X700114327 <int>, X700114328 <int>, X700114271 <int>,
      #   X700114223 <int>, X700114012 <int>, X700114117 <int>, X700111749 <int>, X700113560 <int>, X700114709 <int>, X700108534 <int>, X700113020 <int>,
      #   X700114005 <int>, X700114212 <int>, X700114274 <int>, X700114162 <int>, X700108165 <int>, X700113059 <int>, X700114330 <int>, X700111035 <int>,
      #   X700110827 <int>, X700114051 <int>, X700105883 <int>, X700114282 <int>, X700114279 <int>, X700113019 <int>, X700114750 <int>, X700114386 <int>,
      #   X700113502 <int>, X700109121 <int>, X700111515 <int>, X700110308 <int>, X700114113 <int>, X700113060 <int>, X700111520 <int>, X700114431 <int>,
      #   X700114006 <int>, X700111243 <int>, X700114105 <int>, X700114056 <int>, X700110426 <int>, X700106490 <int>, X700110433 <int>, X700114385 <int>,
      #   X700114438 <int>, X700111521 <int>, X700113016 <int>, X700111518 <int>, X700111306 <int>, X700114440 <int>, X700113652 <int>, X700114048 <int>,
      #   X700111449 <int>, X700110831 <int>, X700114112 <int>, X700111824 <int>, X700113017 <int>, X700114333 <int>, X700114004 <int>, X700110654 <int>,
      #   X700114611 <int>, X700114210 <int>, X700114339 <int>, X700111305 <int>, X700114170 <int>, X700111163 <int>, X700110657 <int>, X700110173 <int>,
      #   X700114711 <int>, X700113116 <int>, X700111581 <int>, X700111754 <int>, X700111517 <int>, X700114177 <int>, X700113605 <int>, ...
    class_data:
      # A tibble: 264,426 x 5
        taxon_id input_index hmp_rank    hmp_tax   regex_match
           <chr>       <int>    <chr>      <chr>         <chr>
      1      aab           1                Root          Root
      2      aac           1        p Firmicutes p__Firmicutes
      3      abc           1        c    Bacilli    c__Bacilli
      # ... with 2.644e+05 more rows
  0 functions:

## Simple plotting and filtering

# Plot all
hmp_data %>%
  heat_tree(node_size = n_obs,
            node_color = n_obs,
            node_label = taxon_names)

rplot

# Plot one taxon and subtaxa
hmp_data %>%
  filter_taxa(taxon_names == "Firmicutes", subtaxa = TRUE) %>%
  heat_tree(node_size = n_obs,
            node_color = n_obs,
            node_label = taxon_names)

rplot01

# Remove one levels of the hierarchy
hmp_data %>%
  filter_taxa(n_supertaxa == 3, supertaxa = TRUE) %>%
  heat_tree(node_size = n_obs,
            node_color = n_obs,
            node_label = taxon_names)

rplot02

# Remove intermediate levels of the hierarchy
hmp_data %>%
  filter_taxa(n_supertaxa != 3) %>%
  heat_tree(node_size = n_obs,
            node_color = n_obs,
            node_label = taxon_names)

rplot03

# Randomly sample OTUs and remove taxa not sampled
hmp_data %>%
  sample_n_obs("tax_data", 100, drop_taxa = TRUE) %>%
  heat_tree(node_size = n_obs,
            node_color = n_obs,
            node_label = taxon_names)

rplot04

# Break the taxonomy into multiple roots
hmp_data %>%
  filter_taxa(n_supertaxa != 0) %>%
  heat_tree(node_size = n_obs,
            node_color = n_obs,
            tree_label = taxon_names,
            node_label = taxon_names)

rplot05

jnpaulson commented 7 years ago

@zachary-foster @joey711 @hcorrada @jmwagner

Zack, very interesting. I know there's a greater need for basic core structures implemented across packages for greater flexibility. It'd be great to have a greater discussion.

zachary-foster commented 7 years ago

@jnpaulson, yea I would be very interested in discussing making packages in the taxonomy/metagenomics R ecosystem more compatible! @sckott and I are working on making metacoder and taxize use the taxa package, so there will be plotting and taxonomy database-querying capabilities soon using these classes.

If others are interested in setting up a discussion on this topic, perhaps we can open an issue in the taxa repo on it or any other format/place that works.

joey711 commented 7 years ago

@zachary-foster @joey711 @hcorrada @jmwagner

That looks great!

If I had been able to foresee/use dplyr + tidyverse when I first began working on phyloseq in 2011, I would have definitely taken this approach. I had always wanted to separate the phyloseq analysis + graphics functionality from its data infrastructure. Definitely room for improvement, and I'm happy to support/contribute something with general community support.

Keep in mind there currently exists functionality along the lines in your example in phyloseq (e.g. filter_taxa is also a method in phyloseq), metagenomeSeq, and other packages. One way to help gain consensus and avoid toe-stepping is to acknowledge previous work/ideas in this description. Nevertheless, I think we can all agree a single "core" representation of the same data type is potentially an improvement for all of our work, a sign of maturity for microbiome analysis in R, and also hopefully provides a mechanism to "share the load" of supporting current and future file formats. I'm sure I have some currently open phyloseq bugs regarding exactly this type of functionality, and using standard tidyverse semantics is going to help user adoption for all of us.

Side note: The biom-format might not be ideal, but it isn't going anywhere. A lot of work has been done on it, much more on the python-side, and HDF5 much more so. I would not recommend reinventing the file representation, but the R-side data representation.

The hard part here is

(1) Agreeing on stuff so that we can get consensus community adoption, and
(2) future maintenance... who is going to fix bugs, add support when it is needed, answer user issues, etc.? It is well-known that it is hard to get funding to maintain software or databases. Good design choices can go a long way mitigating the amount of future maintenance work, and often also help with adoption, but some work will need to be done or we risk a situation where a bug in this consensus package breaks many packages, slows progress in those packages, etc.

Thoughts?

Consider me a hopeful future contributor ...

zachary-foster commented 7 years ago

That looks great!

Thanks!

I had always wanted to separate the phyloseq analysis + graphics functionality from its data infrastructure.

Yea, the same kind of thing happened with my metacoder package. Originally, all the data structures and manipulation functions were mixed in with the plotting, and then @sckott and I decided to combine our previously independent efforts to make general taxonomy classes into the taxa package, so very soon metacoder will only be plotting, analysis, and parsing specific to community analysis. I am now thinking about eventually moving the generic tree plotting to its own package sometime in the future.

Definitely room for improvement, and I'm happy to support/contribute something with general community support.

Awesome! taxa does not have general support yet since it has only been out for a few months, but we are hopeful. taxize is a pretty popular package and once it is adapted to use taxa, I think more people will adopt it taxa too. People also seem to like metacoder when I present it and it will based on taxa soon as well.

Keep in mind there currently exists functionality along the lines in your example in phyloseq (e.g. filter_taxa is also a method in phyloseq), metagenomeSeq, and other packages. One way to help gain consensus and avoid toe-stepping is to acknowledge previous work/ideas in this description.

Yes, I definitely don't want to annoy anyone by not giving them credit or push aside their contributions. To be honest, I started work on this as code for my own work and only later made it a general R package, so I don't have a lot of experience with other packages that do similar things. I did not know about phyloseq when I first started. It was not until I looked into writing a parser for phyloseq objects that I realized there is already a filter_taxa function there. I am still not sure what to do about the name conflict. Since the functions are named in a thematic way (e.g. arrange_taxa, sample_n_taxa, filter_taxa) I hesitate to change it, but having the conflict is confusing as well. I still have a lot to learn about what other people are doing along the same lines. I would like to do a systematic review of available resources in taxonomic/community analysis packages when I get a chance.

Nevertheless, I think we can all agree a single "core" representation of the same data type is potentially an improvement for all of our work, a sign of maturity for microbiome analysis in R, and also hopefully provides a mechanism to "share the load" of supporting current and future file formats. I'm sure I have some currently open phyloseq bugs regarding exactly this type of functionality, and using standard tidyverse semantics is going to help user adoption for all of us.

Yea, this is the idea that led @sckott and I to write the taxa package. Having many packages that all can use the same classes would encourage even more packages and users since people would not have to reinvent the parsing, storage, and manipulation of data.

The biom-format might not be ideal, but it isn't going anywhere. A lot of work has been done on it, much more on the python-side, and HDF5 much more so. I would not recommend reinventing the file representation, but the R-side data representation.

I agree. I would like to make a function that can write taxmap contents in biome files and another to read them such that there is no information loss. I am not planning on proposing a new file standard, but to just give users a few choices on how to save their data in currently supported file formats (biome, csv/tsv, fasta, nexus, etc). It would be nice if each pair of writers/parsers work losslessly so this framework can be used for subsetting and converting file formats.

Agreeing on stuff so that we can get consensus community adoption

Yea, this will be a challenge. I have approached this problem by trying to make code as abstract as possible, so that it should be adaptable to diverse problems. Once taxa is a bit more mature, I want to contact maintainers of leading microbiome analysis packages and offer to help make things more compatible.

who is going to fix bugs, add support when it is needed, answer user issues, etc.? It is well-known that it is hard to get funding to maintain software or databases. Good design choices can go a long way mitigating the amount of future maintenance work, and often also help with adoption, but some work will need to be done or we risk a situation where a bug in this consensus package breaks many packages, slows progress in those packages, etc.

Yea, this might be a problem. I am applying for a year-long ropensci fellowship to work on this which would help. Also @sckott is a maintainer of taxa and he (I think) works on ropensci packages full time. Hopefully, if a robust community standard emerges, there will be enough willing contributers that any one of them does not have to do too much.

Consider me a hopeful future contributor ...

Awesome, thanks! I am glad you are interested!

I need to take a closer look at phyloseq and metagenomeSeq so I can avoid any conflicts or redundancy in the future. Once I understand them and other similar packages better I would be happy to help making the major microbiome packages in R more compatible.

joey711 commented 7 years ago

Sounds great.

Side note: S4 methods with the same name don't necessarily clobber each other, since they have a formal argument signature. This is one of the ways Bioconductor works even with so many shared classes and methods in namespaces that get quite large. I'm happy to migrate any colliding functions in phyloseq to S4 methods to avoid this. You might want to consider doing the same, defensively, especially for a core package. I think the idea is that the generic classes + methods would be defined in the core package, and our own special-op packages would extend these.

zachary-foster commented 7 years ago

Yes, S4 might be the way to go. We currently use R6 wrapped in S3 in taxa actually, but I am concerned that this might slow adoption, so I am open to considering switching to S4 if it becomes a problem.

I think the idea is that the generic classes + methods would be defined in the core package, and our own special-op packages would extend these.

Agreed

joey711 commented 7 years ago

I actually like R6 a lot...

but, yeah, switching the S3 part to S4 might help with other stuff...

zachary-foster commented 7 years ago

Nice, i am glad to hear others like R6. I like it having learn object oriented languages, but I was not sure what most R users thought of it. I like the idea of moving the S3 wrappers to S4 if that will solve the naming conflicts. I will see what @sckott thinks about it.

sckott commented 7 years ago

Can we see what an example would look like of moving one of the taxa S3 functions to S4?

I'm not quite sure I understand the problem with S3? Maybe I missed it. Can someone clarify?

I see there is something you mentioned above @zachary-foster about a name conflict. fwiw, i think taxa is early enough that we can consider changing function names

zachary-foster commented 7 years ago

Can we see what an example would look like of moving one of the taxa S3 functions to S4?

I dont know yet, I have not looked into it much.

I'm not quite sure I understand the problem with S3? Maybe I missed it. Can someone clarify?

We hoped to not have to change the names of either phyloseq::filter_taxa or taxa::filter_taxa and we thought if both the phyloseq class and the taxmap class had S4 wrappers, then both functions would be called on the right classes. Perhaps making the phyloseq function S3 would work as well, but I think Bioconductor encourages S4? Thinking about it more, I am not sure how to do this or how complicated it would be. I don't have much experience with S4.

I see there is something you mentioned above @zachary-foster about a name conflict. fwiw, i think taxa is early enough that we can consider changing function names

That is probably the easiest solution and I would be ok with it. However, since the taxa package is meant to be the foundation for other packages, I think this is an issue that will come up repeatedly so it would be great If we could find a way to make taxa work with other packages that have identical function names and S3, S4, or R6.

sckott commented 7 years ago

I think bioc requires S4? I don't know S4 either :(

good point that we can't keep changing fxn names going forward, esp. after pkgs on cran/bioc use taxa

zachary-foster commented 7 years ago

I made an issue in the taxa repo so we can continue this conversion without cluttering this issue more.

https://github.com/ropensci/taxa/issues/99

joey711 commented 7 years ago

Before making design decisions on a package meant to be "core", I highly recommend reading Hadley's field guide to OO in R.

Quick comment, S4 is a complete OO, but uses functional semantics. R5 and R6 use the more common "reference" semantics that you've seen elsewhere. S3 is somewhat incomplete, dispatch not explicit, etc.

http://adv-r.had.co.nz/OO-essentials.html

joey711 / biom

documentation on subsetting, and large data memory support? #26