sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
478 stars 79 forks source link

Tutorial: From raw metagenome reads to phyloseq taxonomy table using sourmash gather and sourmash taxonomy #2156

Open taylorreiter opened 2 years ago

taylorreiter commented 2 years ago

Posted here! https://taylorreiter.github.io/2022-07-28-From-raw-metagenome-reads-to-phyloseq-taxonomy-table-using-sourmash-gather-and-sourmash-taxonomy/

ctb commented 4 months ago

working on fixing this up for fastgather here:

https://hackmd.io/aYV5HvqNRK63WDzB2ABOYw

ctb commented 4 months ago

From raw metagenome reads to phyloseq taxonomy table using sourmash gather and sourmash taxonomy (STAMPS 2024)

Updated by CTB 7/24/24 for STAMPS: uses fastgather now.

Introduction to gather and sourmash taxonomy

Sourmash gather is a tool that will tell you the minimum set of genomes in a database needed to “cover” all of the sequences in your sample of interest.

Sourmash gather is really accurate. Below I’ve included two plots from a recent preprint that details how sourmash gather works. They show that sourmash gather has the highest completeness and purity at the species level, while maintaining a low L1 norm error. Completeness is the ratio of taxa correctly identified in the ground truth (higher is better), purity is the ratio of correctly predicted taxa over all predicted taxa (higher is better), and L1 norm error is the ratio true positives to false positives (lower is better).

image image

Sourmash gather works best for metagenomes from environments that have been sequenced before or from which many genomes have been isolated and sequenced. Because k-mers are so specific, a genome needs to be in a database in order for sourmash gather to find a match. Sourmash gather won’t find much above species (k = 31) or genus (k = 21) similarity, so if most of the organisms in a sample are new, sourmash won’t be able to label them.

Sourmash taxonomy makes the sourmash gather output more interpretable. Previously, we only output the statistics about the genomes in a database that were found in a metagenome. While this information is useful, taxonomic labels make this type of information infinitely more useful. The taxonomy commands were added into sourmash in August 2021, but we haven’t done a good job advertising them or writing tutorials about how to use them and how to integrate with other visualization and analysis tools.

This blog post seeks to close that gap a bit by demonstrating how to go from raw metagenome reads to a phyloseq taxonomy table using sourmash gather and sourmash taxonomy to make the actual taxonomic assignments.

Tutorial Software

In the RStudio terminal:

conda create -y -n smash_to_phylo sourmash sourmash_plugin_branchwater
conda activate smash_to_phylo
mkdir ~/sourmash_tutorial
cd ~/sourmash_tutorial

Tutorial Data

We’ll use six samples from the iHMP IBD cohort. These are short shotgun metagenomes from stool microbiomes. While this is a longitudinal study, these are all time point 1 samples taken from different individuals. All individuals were symptomatic at time point 1, but three individuals were diagnosed with Crohn’s disease (cd) at the end of the year, while three individuals were not (nonIBD).

run_accession sample_id group
SRR5936131 PSM7J199 cd
SRR5947006 PSM7J1BJ cd
SRR5935765 HSM6XRQB cd
SRR5936197 MSM9VZMM nonibd
SRR5946923 MSM9VZL5 nonibd
SRR5946920 HSM6XRQS nonibd

If you want to start with .fastq files start by downloading the files from the European Nucleotide Archive. Since these are paired-end files, we’ll end up running 12 download commands. Read below before downloading

# Terminal

# SRR5936131
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/001/SRR5936131/SRR5936131_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/001/SRR5936131/SRR5936131_2.fastq.gz

# SRR5947006
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/006/SRR5947006/SRR5947006_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/006/SRR5947006/SRR5947006_2.fastq.gz

# SRR5935765
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/005/SRR5935765/SRR5935765_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/005/SRR5935765/SRR5935765_2.fastq.gz

# SRR5936197
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/007/SRR5936197/SRR5936197_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/007/SRR5936197/SRR5936197_2.fastq.gz

# SRR5946923
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/003/SRR5946923/SRR5946923_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/003/SRR5946923/SRR5946923_2.fastq.gz

# SRR5946920
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/000/SRR5946920/SRR5946920_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/000/SRR5946920/SRR5946920_2.fastq.gz

Suggested: If short on time :point_left: :::info

To download or not to download :stopwatch: To make this tutorial realistic, I’ve started with real FASTQ files from real metagenomes. If you have a slow internet connection, they may take a long time to download. If you would prefer to start this tutorial from the sourmash signatures, you can download those (small) files using the commands below in lieu of downloading the data and creating them yourself.

Note that the download step and the sketch step are the steps that take the longest, so starting from the signatures will save a substantial (hours?) amount of time.

# Terminal
wget -O sigs.tar.gz https://osf.io/download/9sfpt/
gunzip sigs.tar.gz
tar -xvf sigs.tar 

:::

Using sourmash gather to determine the composition of a metagenome

Using the FASTQ files, we’ll first generate FracMinHash sketches using sourmash sketch. These are the files that sourmash gather will ingest. Note: You do not have to do this if you downloaded the sketches

Skip this step if you downloaded the .sig files.

# Terminal
for infile in *_1.fastq.gz
do
    bn=$(basename ${infile} _1.fastq.gz)
    sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund --merge ${bn} -o ${bn}.sig ${infile} ${bn}_2.fastq.gz
done

We also need a sourmash database. For this tutorial, we’ll use the GTDB rs207 representatives database. It’s relatively small (~2GB) so it will be faster to download and run, however there are many more prepared databases available, including the full GTDB rs207. The full database will generally match more sequences in the sample than the representatives database, so it would likely produce more complete results for most metagenomes.

# Terminal 
#cd sigs/ 

#wget -O gtdb-rs207.genomic-reps.dna.k31.zip https://osf.io/3a6gn/download

#@CTB: on Jetstream instances, instead link in:

ln -s /opt/shared/sourmash-data/gtdb-rs214-k31.zip .

::: info Choosing a Database :thinking_face: If you have time, space, and RAM, the minimum database I would recommend using is the full GTDB rs207 (~320k genomes). This database is larger than the GTDB representatives, but still allows you to leverage the benefits of the principled GTDB taxonomy. The GTDB databases will work well for well-studied environments like the human microbiomes. If you’re working with a sample from a more complex (ocean, soil) or understudied environment, you will probably see more matches when comparing your sample against all of GenBank. Additionally, GTDB only focuses on bacteria and archaea; if you’re interested in viruses, fungi, or protozoa, these genomes are only in the GenBank database. When using the GenBank databases, you can use multiple databases in a single gather command and multiple taxonomy/lineage files to the sourmash taxonomy command.

Database File size Time to run
GTDB rs207 representatives 1.7 GB 6 minutes
GTDB rs207 full 9.4 GB 30 minutes
GenBank (March 2022) 40 GB 2 hours

:::

Now, we’ll run sourmash fastgather to figure out the minimum set of genomes in our database that cover all of the known k-mers in our metagenomes.

# Terminal
for infile in *sig
do
    bn=$(basename $infile .sig)
    sourmash scripts fastgather ${infile} gtdb-rs214-k31.zip -o ${bn}_gather_gtdb_rs214.csv
done

Determining the taxonomic composition of a sample using sourmash taxonomy

Once we have our gather results, we have to translate them into a taxonomy using sourmash taxonomy. The genomes by default don’t have a taxonomic lineage, so without this information we wouldn’t be able to relate one genome to another very easily.

First, we have to download and format the taxonomy spreadsheet:

# Terminal
#wget -O gtdb-rs207.taxonomy.csv.gz https://osf.io/v3zmg/download
#gunzip gtdb-rs207.taxonomy.csv.gz
#sourmash tax prepare -t gtdb-rs207.taxonomy.csv -o gtdb-rs207.taxonomy.sqldb -F sql

# @CTB on STAMPS jetstream instead link it in:
ln -fs /opt/shared/sourmash-data/gtdb-rs214.lineages.sqldb .

We will use sourmash taxonomy annotate to add the taxonomic lineage to each gather match. This command adds a new column into the gather output csv file, and by default writes a new file containing all of the information.

# Terminal
for infile in *_gather_gtdb_rs214.csv
do
    sourmash tax annotate -g ${infile} -t gtdb-rs214.lineages.sqldb 
done

:::info Sourmash gather and taxonomy taking too long? Download the Sourmash gather and taxonomy here.

# Terminal
wget -O taxonomy.tar.gz https://osf.io/kf9n8/download
tar xf taxonomy.tar.gz

:::

Importing the gather results into a phyloseq object

Phyloseq (and its successors like speedyseq) organize microbial ecology data into R data structures. While these packages offer a lot of functionality for analyzing microbial abundance information, other packages also use phyloseq data structures even if they don’t use phyloseq functions.

One particularly handy function is tax_glom() which allows users to agglomerate counts up levels of taxonomy, for example summarizing all counts at the phylum level.

The sourmash gather -> taxonomy framework has all of the information we need to take advantage of these types of functionality, if only we can get all the data into the right format to be ingested by phyloseq!

We outline the commands needed below. These command are all executed in R. We’ll use the tidyverse for the majority of data wrangling before creating the actual phyloseq object.

In the Rstudio console: (make sure your working directory is sourmash_tutorial/)

# R Console
library(tidyverse)
library(phyloseq)

# create a metadata table -------------------------------------------------

# create a metadata that has the SRR run accession numbers and other phenotypic information about the samples
# usually, this might be in a csv or tsv file that might be read in with read_csv().
# here, we'll make it from the table earlier in the workflow.

run_accessions <- c("SRR5936131", "SRR5947006", "SRR5935765",
                    "SRR5936197", "SRR5946923", "SRR5946920")
sample_ids <- c("PSM7J199", "PSM7J1BJ", "HSM6XRQB",
                "MSM9VZMM", "MSM9VZL5", "HSM6XRQS")
groups <- c("cd", "cd ", "cd", "nonibd", "nonibd", "nonibd")

metadata <- data.frame(run_accessions = run_accessions, 
                       sample_ids = sample_ids, 
                       groups = groups) %>%
  column_to_rownames("run_accessions")

# read in sourmash gather & taxonomy results ------------------------------

# read in the sourmash taxonomy results from all samples into a single data frame
sourmash_taxonomy_results <- Sys.glob("*gather_gtdb_rs214.with-lineages.csv") %>%
  map_dfr(read_csv, col_types = "ddddddddcccddddcccdc") %>%
  mutate(match_name = gsub(" .*", "", match_name)) 

# We need two tables: a tax table and an "otu" table. 
# The tax table will hold the taxonomic lineages of each of our gather matches.
# To make this, we'll make a table with two columns: the genome match and the lineage of the genome.
# The "otu" table will have the counts of each genome in each sample.
# We'll call this our gather_table.

tax_table <- sourmash_taxonomy_results %>%
  select(match_name, lineage) %>%
  distinct() %>%
  separate(lineage, into = c("domain", "phylum", "class", "order", "family", "genus", "species"), sep = ";") %>%
  column_to_rownames("match_name")

gather_table <- sourmash_taxonomy_results %>% 
  mutate(n_unique_kmers = (unique_intersect_bp / scaled) * average_abund) %>% # calculate the number of uniquely matched k-mers and multiply it by average k-mer abundance
  select(query_name, match_name, n_unique_kmers) %>% # select only the columns that have information we need
  pivot_wider(id_cols = match_name, names_from = query_name, values_from = n_unique_kmers) %>% # transform to wide format
  replace(is.na(.), 0) %>% # replace all NAs with 0 
  column_to_rownames("match_name") # move the genome name to a rowname

# create a phyloseq object ------------------------------------------------

physeq <- phyloseq(otu_table(gather_table, taxa_are_rows = T),
                   tax_table(as.matrix(tax_table)),
                   sample_data(metadata))

From raw metagenome reads to taxonomic differential abundance with sourmash and corncob

Now we demonstrates how to do differential abundance analysis with the same data using corncob. Corncob works really well for microbial count data because it accounts for sparsity (lots of unobserved taxonomic lineages in samples) and variable sequencing depth.

Building the phyloseq object

We've already built a phyloseq object from the sourmash taxonomy output. Corncob doesn’t require a phyloseq object to run, but it does play well with phyloseq objects so we'll use that as a jumping point.

Differential abundance testing with corncob

Once we have the phyloseq object physeq, running corncob is fairly straightforward. In this case, our two groups are stool microbiomes from individuals with Crohn’s disease (cd) and individuals without Crohn’s disease (nonIBD).

Let’s start by testing for differential abundance for a specific genome that sourmash gather identified in our samples. We’ll test a genome that was identified in all of our samples. Using the below code block, we can take a look at 6 genomes that were identified in all samples.

# R Console
sourmash_taxonomy_results %>% 
  group_by(match_name) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  head()
# output
match_name                n
<chr>           <int>
1 GCA_001915545.1     6
2 GCA_003112755.1     6
3 GCA_003573975.1     6
4 GCA_003584705.1     6
5 GCA_900557355.1     6
6 GCA_900755095.1     6

Randomly pulling out one of these genomes, let’s take a look at its taxonomic lineage:

# R console
sourmash_taxonomy_results %>%
  filter(match_name == "GCA_003112755.1") %>%
  select(lineage) %>%
  distinct()

This genome is assigned to s__Lawsonibacter asaccharolyticus in the GTDB taxonomy.

# output

lineage
<chr>                                          
1 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Lawsonibacter;s__Lawsonibacter asaccharolyticus

Testing differential abundance for a single genome

Corncob builds a model for each genome/ASV/taxonomic lineage separately, so we can start by building a model for this genome specifically:

# R Console 
library(corncob)
install.packages("janitor") # if necessary, comment out after installing
library(janitor)

corncob <- bbdml(formula = GCA_003112755.1 ~ groups,
                 phi.formula = ~ 1,
                 allow_noninteger = TRUE,
                 data = physeq)

corncob

We can plot the results:

# R Console
plot(corncob, color = groups)

While our results show that this genome is significantly differentially abundant in CD when compared to nonIBD, this result doesn’t seem super biologically meaningful; visually, the mean abundances of the genome aren’t that different between groups.

The corncob vignette includes a lot more information about creating and interpreting models, especially for more complex experimental designs.

Testing differential abundance at different levels of taxonomy

Because corncob integrates well with phyloseq, we can agglomerate to different levels of taxonomy and test for differential abundance systematically. Let’s test for differential abundance for all genuses observed in our data set.

# R Console

# Aggregate the data at the genus and family level
physeq_genus <- physeq %>%
  tax_glom("genus")

corncob_genus <- differentialTest(formula = ~ groups,
                                  phi.formula = ~ 1,
                                  formula_null = ~ 1,
                                  phi.formula_null = ~ 1,
                                  data = physeq_genus,
                                  allow_noninteger = T,
                                  test = "Wald", B = 100,
                                  fdr_cutoff = 0.1)

Then we can plot the results, showing only taxa that were significantly different at an FDR cutoff of 0.1

# R Console

plot(corncob_genus)

image

By default, the bbdml() function does not do multiple testing correction because it only performs one test. If you choose to do multiple tests, be sure to include p value correction. In contrast, differentialTest() does calculate the false discovery rate for all of the tests performed for that run.

Using corncob_genus$significant_models, we can see the results of corncob. Below, we’ll parse the output into a dataframe for easier downstream use of the corncob results.

First, we’ll write a function that takes the differentialTest() results object as input and parses it into a dataframe of taxa-labelled coefficients.

# R console

glance_significant_models <- function(corncob_results){
  df <- data.frame()
  for(i in 1:length(corncob_results$significant_taxa)){
    taxa_name <- corncob_results$significant_taxa[i]
    taxa_df <- corncob_results$significant_models[[i]]$coefficients %>%
      as.data.frame() %>%
      rownames_to_column("covariate") %>%
      mutate(taxa_name = taxa_name)
    df <- bind_rows(df, taxa_df)
  }
  df <- clean_names(df) %>%
    select(covariate, estimate, std_error, t_value, fdr = pr_t, taxa_name)
  return(df)
}

Then we’ll use this function to parse the results and filter to coefficients with a false discovery rate < 0.1.

# R Console

glance_significant_models(corncob_genus) %>%
  filter(fdr < 0.1) %>%
  filter(!grepl("Intercept", covariate))
covariate estimate std_error t_value fdr taxa_name
mu.groupsnonibd -2.6052725 0.6667681 -3.907314 0.05969535 GCF_002234575.2
mu.groupscd -1.4867950 0.4398837 -3.379973 0.07749496 GCF_000012825.1
mu.groupscd -0.8150979 0.2502086 -3.257673 0.08270706 GCF_009696065.1
mu.groupsnonibd -2.2554143 0.2852505 -7.906784 0.01562175 GCF_009696065.1
mu.groupscd 3.5106793 1.1003016 3.190652 0.08577909 GCA_003478935.1
mu.groupscd 3.4957077 1.1778541 2.967861 0.09725123 GCF_015557635.1
mu.groupscd 3.0423122 0.9160323 3.321185 0.07993957 GCF_900537995.1
mu.groupscd 3.5415046 1.1134170 3.180753 0.08624662 GCF_003478505.1
mu.groupsnonibd 3.2390747 1.0717325 3.022279 0.09425566 GCF_003478505.1
mu.groupscd 3.9035629 1.2645871 3.086828 0.09087019 GCF_015560765.1

Phyloseq has an …odd behavior of assigning a genome/OTU name instead of the lineage to aggregated taxa. The last parsing step we’ll perform is re-deriving the genus-level lineage for these results using the physeq_genus taxonomy table.

# R console

physeq_genus_tax_table <- tax_table(physeq_genus) %>% 
  as.data.frame() %>%
  rownames_to_column("taxa_name")

glance_significant_models(corncob_genus) %>%
  filter(fdr < 0.1) %>%
  filter(!grepl("Intercept", covariate)) %>%
  left_join(physeq_genus_tax_table)

Instead of printing this dataframe to the console, we can write it to a file to use later, for example for visualization.

# R Console

glance_significant_models(corncob_genus) %>%
  filter(fdr < 0.1) %>%
  filter(!grepl("Intercept", covariate)) %>%
  left_join(physeq_genus_tax_table) %>%
  write_tsv("corncob_results_genus.tsv")
covariate estimate std_error t_value fdr taxa_name
mu.groupsnonibd -2.6052725 0.6667681 -3.907314 0.05969535 GCF_002234575.2
mu.groupscd -1.4867950 0.4398837 -3.379973 0.07749496 GCF_000012825.1
mu.groupscd -0.8150979 0.2502086 -3.257673 0.08270706 GCF_009696065.1
mu.groupsnonibd -2.2554143 0.2852505 -7.906784 0.01562175 GCF_009696065.1
mu.groupscd 3.5106793 1.1003016 3.190652 0.08577909 GCA_003478935.1
mu.groupscd 3.4957077 1.1778541 2.967861 0.09725123 GCF_015557635.1
mu.groupscd 3.0423122 0.9160323 3.321185 0.07993957 GCF_900537995.1
mu.groupscd 3.5415046 1.1134170 3.180753 0.08624662 GCF_003478505.1
mu.groupsnonibd 3.2390747 1.0717325 3.022279 0.09425566 GCF_003478505.1
mu.groupscd 3.9035629 1.2645871 3.086828 0.09087019 GCF_015560765.1

Visualizing the taxonomic composition of metagenomes using sourmash and metacoder

Here, we’ll use the tidyverse and metacoder R packages to visualize the taxonomic composition of a metagenome produced from sourmash gather and sourmash taxonomy. Metacoder is an R package for visualizing heat trees from taxonomic data. First we’ll visualize the taxonomy of all of our samples, then we’ll color this visualization by taxa identified within a specific group, and lastly we’ll incorporate differential abundance results from the previous tutorial into the visualization.

Importing the sourmash gather and sourmash taxonomy results into a metacoder object

We’ll start by loading the necessary R packages, creating a metadata dataframe, and reading the sourmash taxonomy results into a dataframe.

# R console

library(tidyverse)
#@CTB you'll need to do this once, on the Jetstream instances: 
# install.packages("metacoder")
library(metacoder)

# create a metadata that has the SRR run accession numbers and other phenotypic information about the samples
# usually, this might be in a csv or tsv file that might be read in with read_csv().
# here, we'll make it from the table earlier in the workflow.

run_accessions <- c("SRR5936131", "SRR5947006", "SRR5935765",
                    "SRR5936197", "SRR5946923", "SRR5946920")
sample_ids <- c("PSM7J199", "PSM7J1BJ", "HSM6XRQB",
                "MSM9VZMM", "MSM9VZL5", "HSM6XRQS")
groups <- c("cd", "cd ", "cd", "nonibd", "nonibd", "nonibd")

metadata <- data.frame(run_accessions = run_accessions, 
                       sample_ids = sample_ids, 
                       groups = groups) 

# read in the sourmash taxonomy results from all samples into a single data frame
sourmash_taxonomy_results <- Sys.glob("*gather_gtdb_rs214.with-lineages.csv") %>%
  map_dfr(read_csv, col_types = "ddddddddcccddddcccdc") %>% # read in all of the taxonomy results
  mutate(match_name = gsub(" .*", "", match_name)) %>% # simplify the genome name to only include the accession
  mutate(n_unique_kmers = unique_intersect_bp / scaled) # calculate the number of uniquely matched k-mers

We’ll join the sourmash taxonomy results to the metadata we have to add information about our sample groups.

# R Console

sourmash_taxonomy_results <- sourmash_taxonomy_results %>%
  left_join(metadata, by = c("query_name" = "run_accessions"))

Then, we’ll parse the sourmash taxonomy results into a metacoder object. To start with, this object contains the input data we supplied it (query_name, groups, name (genome accession), n_unique_weighted_found, and lineage) and parsed taxonomic data derived from each observed lineage.

# R Console

#parse the taxonomy results into a metacoder object
obj <- parse_tax_data(sourmash_taxonomy_results %>% select(query_name, groups, match_name, n_unique_weighted_found, lineage),
                      class_cols = "lineage", # the column that contains taxonomic information
                      class_sep = ";", # The character used to separate taxa in the classification
                      class_regex = "^(.+)__(.+)$", # Regex identifying where the data for each taxon is
                      class_key = c(tax_rank = "info", # A key describing each regex capture group
                                    tax_name = "taxon_name"))

We can then calculate the number of abundance-weighted k-mers observed at each level of taxonomy by using the calc_taxon_abund().

R Console

#set number of unique k-mers assigned to a genome as the abundance information
obj$data$tax_abund <- calc_taxon_abund(obj, "tax_data", cols = c("n_unique_weighted_found"))

Visualizing the taxonomic composition of the metagenome samples

Taxonomy of all samples

Using the metacoder object, we can now visualize our abundance and taxonomy data using the heat_tree() function. Giving heat_tree() the full object, we’ll first visualize the taxonomic breakdown of all of our samples combined.

R Console

# generate a heat_tree plot with all taxa from all samples
set.seed(1) # This makes the plot appear the same each time it is run 
heat_tree(obj, 
          node_label = taxon_names,
          node_size = n_obs,
          node_size_axis_label = "k-mer abundance",
          node_color_axis_label = "samples with genomes",
          #output_file = "metacoder_all.png", # uncomment to save file upon running
          layout = "davidson-harel", # The primary layout algorithm
          initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations

image

Conclusion

Metacoder offers many visualization options, including coloring by many groups. See the documentation and example for more.

Adapted from: Reiter,Taylor.(2022, August 29). Adventures in Seq Land.

ctb commented 4 months ago

Here's all the R code in that tutorial, primed to work with the CSV files from the attached zip file and the following conda install:

conda create -y -n smash_r r-corncob r-metacoder r-janitor r-tidyverse bioconductor-phyloseq

gather-for-phyloseq.zip

library(tidyverse)
library(phyloseq)

# create a metadata table -------------------------------------------------

# create a metadata that has the SRR run accession numbers and other phenotypic information about the samples
# usually, this might be in a csv or tsv file that might be read in with read_csv().
# here, we'll make it from the table earlier in the workflow.

run_accessions <- c("SRR5936131", "SRR5947006", "SRR5935765",
                    "SRR5936197", "SRR5946923", "SRR5946920")
sample_ids <- c("PSM7J199", "PSM7J1BJ", "HSM6XRQB",
                "MSM9VZMM", "MSM9VZL5", "HSM6XRQS")
groups <- c("cd", "cd ", "cd", "nonibd", "nonibd", "nonibd")

metadata <- data.frame(run_accessions = run_accessions, 
                       sample_ids = sample_ids, 
                       groups = groups) %>%
  column_to_rownames("run_accessions")

# read in sourmash gather & taxonomy results ------------------------------

# read in the sourmash taxonomy results from all samples into a single data frame
sourmash_taxonomy_results <- Sys.glob("*gather_gtdb_rs214.with-lineages.csv") %>%
  map_dfr(read_csv, col_types = "ddddddddcccddddcccdc") %>%
  mutate(match_name = gsub(" .*", "", match_name)) 

# We need two tables: a tax table and an "otu" table. 
# The tax table will hold the taxonomic lineages of each of our gather matches.
# To make this, we'll make a table with two columns: the genome match and the lineage of the genome.
# The "otu" table will have the counts of each genome in each sample.
# We'll call this our gather_table.

tax_table <- sourmash_taxonomy_results %>%
  select(match_name, lineage) %>%
  distinct() %>%
  separate(lineage, into = c("domain", "phylum", "class", "order", "family", "genus", "species"), sep = ";") %>%
  column_to_rownames("match_name")

gather_table <- sourmash_taxonomy_results %>% 
  mutate(n_unique_kmers = (unique_intersect_bp / scaled) * average_abund) %>% # calculate the number of uniquely matched k-mers and multiply it by average k-mer abundance
  select(query_name, match_name, n_unique_kmers) %>% # select only the columns that have information we need
  pivot_wider(id_cols = match_name, names_from = query_name, values_from = n_unique_kmers) %>% # transform to wide format
  replace(is.na(.), 0) %>% # replace all NAs with 0 
  column_to_rownames("match_name") # move the genome name to a rowname

# create a phyloseq object ------------------------------------------------

physeq <- phyloseq(otu_table(gather_table, taxa_are_rows = T),
                   tax_table(as.matrix(tax_table)),
                   sample_data(metadata))

# R Console
sourmash_taxonomy_results %>% 
  group_by(match_name) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  head()

# R console
sourmash_taxonomy_results %>%
  filter(match_name == "GCA_003112755.1") %>%
  select(lineage) %>%
  distinct()

library(corncob)
#install.packages("janitor") # if necessary, comment out after installing
library(janitor)

corncob <- bbdml(formula = GCA_003112755.1 ~ groups,
                 phi.formula = ~ 1,
                 allow_noninteger = TRUE,
                 data = physeq)

corncob

# R Console

# Aggregate the data at the genus and family level
physeq_genus <- physeq %>%
  tax_glom("genus")

corncob_genus <- differentialTest(formula = ~ groups,
                                  phi.formula = ~ 1,
                                  formula_null = ~ 1,
                                  phi.formula_null = ~ 1,
                                  data = physeq_genus,
                                  allow_noninteger = T,
                                  test = "Wald", B = 100,
                                  fdr_cutoff = 0.1)

plot(corncob_genus)

# R console

glance_significant_models <- function(corncob_results){
  df <- data.frame()
  for(i in 1:length(corncob_results$significant_taxa)){
    taxa_name <- corncob_results$significant_taxa[i]
    taxa_df <- corncob_results$significant_models[[i]]$coefficients %>%
      as.data.frame() %>%
      rownames_to_column("covariate") %>%
      mutate(taxa_name = taxa_name)
    df <- bind_rows(df, taxa_df)
  }
  df <- clean_names(df) %>%
    select(covariate, estimate, std_error, t_value, fdr = pr_t, taxa_name)
  return(df)
}

# R Console

glance_significant_models(corncob_genus) %>%
  filter(fdr < 0.1) %>%
  filter(!grepl("Intercept", covariate))

# R console

physeq_genus_tax_table <- tax_table(physeq_genus) %>% 
  as.data.frame() %>%
  rownames_to_column("taxa_name")

glance_significant_models(corncob_genus) %>%
  filter(fdr < 0.1) %>%
  filter(!grepl("Intercept", covariate)) %>%
  left_join(physeq_genus_tax_table)

# R Console

glance_significant_models(corncob_genus) %>%
  filter(fdr < 0.1) %>%
  filter(!grepl("Intercept", covariate)) %>%
  left_join(physeq_genus_tax_table) %>%
  write_tsv("corncob_results_genus.tsv")

# R console

library(tidyverse)
#@CTB you'll need to do this once, on the Jetstream instances: 
# install.packages("metacoder")
library(metacoder)

# create a metadata that has the SRR run accession numbers and other phenotypic information about the samples
# usually, this might be in a csv or tsv file that might be read in with read_csv().
# here, we'll make it from the table earlier in the workflow.

run_accessions <- c("SRR5936131", "SRR5947006", "SRR5935765",
                    "SRR5936197", "SRR5946923", "SRR5946920")
sample_ids <- c("PSM7J199", "PSM7J1BJ", "HSM6XRQB",
                "MSM9VZMM", "MSM9VZL5", "HSM6XRQS")
groups <- c("cd", "cd ", "cd", "nonibd", "nonibd", "nonibd")

metadata <- data.frame(run_accessions = run_accessions, 
                       sample_ids = sample_ids, 
                       groups = groups) 

# read in the sourmash taxonomy results from all samples into a single data frame
sourmash_taxonomy_results <- Sys.glob("*gather_gtdb_rs214.with-lineages.csv") %>%
  map_dfr(read_csv, col_types = "ddddddddcccddddcccdc") %>% # read in all of the taxonomy results
  mutate(match_name = gsub(" .*", "", match_name)) %>% # simplify the genome name to only include the accession
  mutate(n_unique_kmers = unique_intersect_bp / scaled) # calculate the number of uniquely matched k-mers

# R Console

sourmash_taxonomy_results <- sourmash_taxonomy_results %>%
  left_join(metadata, by = c("query_name" = "run_accessions"))

# R Console

#parse the taxonomy results into a metacoder object
obj <- parse_tax_data(sourmash_taxonomy_results %>% select(query_name, groups, match_name, n_unique_weighted_found, lineage),
                      class_cols = "lineage", # the column that contains taxonomic information
                      class_sep = ";", # The character used to separate taxa in the classification
                      class_regex = "^(.+)__(.+)$", # Regex identifying where the data for each taxon is
                      class_key = c(tax_rank = "info", # A key describing each regex capture group
                                    tax_name = "taxon_name"))

#R Console

#set number of unique k-mers assigned to a genome as the abundance information
obj$data$tax_abund <- calc_taxon_abund(obj, "tax_data", cols = c("n_unique_weighted_found"))

#R Console

# generate a heat_tree plot with all taxa from all samples
set.seed(1) # This makes the plot appear the same each time it is run 
heat_tree(obj, 
          node_label = taxon_names,
          node_size = n_obs,
          node_size_axis_label = "k-mer abundance",
          node_color_axis_label = "samples with genomes",
          #output_file = "metacoder_all.png", # uncomment to save file upon running
          layout = "davidson-harel", # The primary layout algorithm
          initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations
ctb commented 4 months ago

updated so it should work with both gather and fastgather CSV files:

library(tidyverse)
library(phyloseq)

# create a metadata table -------------------------------------------------

# create a metadata that has the SRR run accession numbers and other phenotypic information about the samples
# usually, this might be in a csv or tsv file that might be read in with read_csv().
# here, we'll make it from the table earlier in the workflow.

run_accessions <- c("SRR5936131", "SRR5947006", "SRR5935765",
                    "SRR5936197", "SRR5946923", "SRR5946920")
sample_ids <- c("PSM7J199", "PSM7J1BJ", "HSM6XRQB",
                "MSM9VZMM", "MSM9VZL5", "HSM6XRQS")
groups <- c("cd", "cd ", "cd", "nonibd", "nonibd", "nonibd")

metadata <- data.frame(run_accessions = run_accessions, 
                       sample_ids = sample_ids, 
                       groups = groups) %>%
  column_to_rownames("run_accessions")

# read in sourmash gather & taxonomy results ------------------------------

# read in the sourmash taxonomy results from all samples into a single data frame
sourmash_taxonomy_results <- Sys.glob("*gather_gtdb_rs214.with-lineages.csv") %>%
  map_dfr(read_csv, col_types = "ddddddddcccddddcccdc") 
if (!("match_name" %in% names(sourmash_taxonomy_results))) {
  sourmash_taxonomy_results <- sourmash_taxonomy_results %>%
    rename(match_name = name)
}
sourmash_taxonomy_results <- sourmash_taxonomy_results %>%
  mutate(match_name = gsub(" .*", "", match_name)) 

# We need two tables: a tax table and an "otu" table. 
# The tax table will hold the taxonomic lineages of each of our gather matches.
# To make this, we'll make a table with two columns: the genome match and the lineage of the genome.
# The "otu" table will have the counts of each genome in each sample.
# We'll call this our gather_table.

tax_table <- sourmash_taxonomy_results %>%
  select(match_name, lineage) %>%
  distinct() %>%
  separate(lineage, into = c("domain", "phylum", "class", "order", "family", "genus", "species"), sep = ";") %>%
  column_to_rownames("match_name")

gather_table <- sourmash_taxonomy_results %>% 
  mutate(n_unique_kmers = (unique_intersect_bp / scaled) * average_abund) %>% # calculate the number of uniquely matched k-mers and multiply it by average k-mer abundance
  select(query_name, match_name, n_unique_kmers) %>% # select only the columns that have information we need
  pivot_wider(id_cols = match_name, names_from = query_name, values_from = n_unique_kmers) %>% # transform to wide format
  replace(is.na(.), 0) %>% # replace all NAs with 0 
  column_to_rownames("match_name") # move the genome name to a rowname

# create a phyloseq object ------------------------------------------------

physeq <- phyloseq(otu_table(gather_table, taxa_are_rows = T),
                   tax_table(as.matrix(tax_table)),
                   sample_data(metadata))

# R Console
sourmash_taxonomy_results %>% 
  group_by(match_name) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  head()

# R console
sourmash_taxonomy_results %>%
  filter(match_name == "GCA_003112755.1") %>%
  select(lineage) %>%
  distinct()

library(corncob)
#install.packages("janitor") # if necessary, comment out after installing
library(janitor)

corncob <- bbdml(formula = GCA_003112755.1 ~ groups,
                 phi.formula = ~ 1,
                 allow_noninteger = TRUE,
                 data = physeq)

corncob

# R Console

# Aggregate the data at the genus and family level
physeq_genus <- physeq %>%
  tax_glom("genus")

corncob_genus <- differentialTest(formula = ~ groups,
                                  phi.formula = ~ 1,
                                  formula_null = ~ 1,
                                  phi.formula_null = ~ 1,
                                  data = physeq_genus,
                                  allow_noninteger = T,
                                  test = "Wald", B = 100,
                                  fdr_cutoff = 0.1)

plot(corncob_genus)

# R console

glance_significant_models <- function(corncob_results){
  df <- data.frame()
  for(i in 1:length(corncob_results$significant_taxa)){
    taxa_name <- corncob_results$significant_taxa[i]
    taxa_df <- corncob_results$significant_models[[i]]$coefficients %>%
      as.data.frame() %>%
      rownames_to_column("covariate") %>%
      mutate(taxa_name = taxa_name)
    df <- bind_rows(df, taxa_df)
  }
  df <- clean_names(df) %>%
    select(covariate, estimate, std_error, t_value, fdr = pr_t, taxa_name)
  return(df)
}

# R Console

glance_significant_models(corncob_genus) %>%
  filter(fdr < 0.1) %>%
  filter(!grepl("Intercept", covariate))

# R console

physeq_genus_tax_table <- tax_table(physeq_genus) %>% 
  as.data.frame() %>%
  rownames_to_column("taxa_name")

glance_significant_models(corncob_genus) %>%
  filter(fdr < 0.1) %>%
  filter(!grepl("Intercept", covariate)) %>%
  left_join(physeq_genus_tax_table)

# R Console

glance_significant_models(corncob_genus) %>%
  filter(fdr < 0.1) %>%
  filter(!grepl("Intercept", covariate)) %>%
  left_join(physeq_genus_tax_table) %>%
  write_tsv("corncob_results_genus.tsv")

# R console

library(tidyverse)
#@CTB you'll need to do this once, on the Jetstream instances: 
# install.packages("metacoder")
library(metacoder)

# create a metadata that has the SRR run accession numbers and other phenotypic information about the samples
# usually, this might be in a csv or tsv file that might be read in with read_csv().
# here, we'll make it from the table earlier in the workflow.

run_accessions <- c("SRR5936131", "SRR5947006", "SRR5935765",
                    "SRR5936197", "SRR5946923", "SRR5946920")
sample_ids <- c("PSM7J199", "PSM7J1BJ", "HSM6XRQB",
                "MSM9VZMM", "MSM9VZL5", "HSM6XRQS")
groups <- c("cd", "cd ", "cd", "nonibd", "nonibd", "nonibd")

metadata <- data.frame(run_accessions = run_accessions, 
                       sample_ids = sample_ids, 
                       groups = groups) 

# read in the sourmash taxonomy results from all samples into a single data frame
sourmash_taxonomy_results <- Sys.glob("*gather_gtdb_rs214.with-lineages.csv") %>%
  map_dfr(read_csv, col_types = "ddddddddcccddddcccdc")
if (!("match_name" %in% names(sourmash_taxonomy_results))) {
  sourmash_taxonomy_results <- sourmash_taxonomy_results %>%
    rename(match_name = name)
}
sourmash_taxonomy_results <- sourmash_taxonomy_results %>%
  mutate(match_name = gsub(" .*", "", match_name)) %>% # simplify the genome name to only include the accession
  mutate(n_unique_kmers = unique_intersect_bp / scaled) # calculate the number of uniquely matched k-mers

# R Console

sourmash_taxonomy_results <- sourmash_taxonomy_results %>%
  left_join(metadata, by = c("query_name" = "run_accessions"))

# R Console

#parse the taxonomy results into a metacoder object
obj <- parse_tax_data(sourmash_taxonomy_results %>% select(query_name, groups, match_name, n_unique_weighted_found, lineage),
                      class_cols = "lineage", # the column that contains taxonomic information
                      class_sep = ";", # The character used to separate taxa in the classification
                      class_regex = "^(.+)__(.+)$", # Regex identifying where the data for each taxon is
                      class_key = c(tax_rank = "info", # A key describing each regex capture group
                                    tax_name = "taxon_name"))

#R Console

#set number of unique k-mers assigned to a genome as the abundance information
obj$data$tax_abund <- calc_taxon_abund(obj, "tax_data", cols = c("n_unique_weighted_found"))

#R Console

# generate a heat_tree plot with all taxa from all samples
set.seed(1) # This makes the plot appear the same each time it is run 
heat_tree(obj, 
          node_label = taxon_names,
          node_size = n_obs,
          node_size_axis_label = "k-mer abundance",
          node_color_axis_label = "samples with genomes",
          #output_file = "metacoder_all.png", # uncomment to save file upon running
          layout = "davidson-harel", # The primary layout algorithm
          initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations
ChillarAnand commented 4 months ago

@taylorreiter Thanks for taking the time to write the detailed tutorial. This saved me a lot of time!

Ivy-ops commented 3 months ago

@ctb , thanks for the nice tutorial!! It helps a lot. May I know how to generate the phy_tree and add it to the phyloseq data object? Thank you!!