grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
135 stars 28 forks source link

Phyloseq and metacoder #141

Closed wipperman closed 5 years ago

wipperman commented 7 years ago

Can you point me to a tutorial or accessor function for how to take an OTU table + metadata and convert this to a taxmap object, which can then be used to make plots? Is this possible with the package? I see that it is in the future directions in the paper, but am unable to figure it out myself (although am able to run all of the software and the examples!). Thanks so much for the help.

tarunaaggarwal commented 6 years ago

Hey @zachary-foster... So I'm very close to getting the graph for only fungal groups but keep running into an edge_label error.

obj = parse_qiime_biom("otu_table_mc2_w_tax_BlankOTUsRemoved_BlankSamplesRemoved_nem46.biom", 
                       class_regex = "^D?_?[0-9]*_?_?(.+)$")
print(obj)

obj$data$otu_table <- zero_low_counts(obj, "otu_table", min_count = 1)
no_reads <- obj$data$otu_table[, 1] == 0
sum(no_reads)

# Convert counts to proportions
obj$data$otu_table <- calc_obs_props(obj,
                                     dataset = "otu_table",
                                     cols = obj$data$sam_data$sample_ids)

# Calculate per-taxon proportions 
obj$data$otu_table <- calc_taxon_abund(obj, 
                                       dataset = "otu_table", 
                                       cols = obj$data$sam_data$sample_ids)

# construct heat tree
obj %>%
  filter_taxa(taxon_names == "Fungi", subtaxa = TRUE) %>%
  heat_tree(obj,
            node_size = n_obs,
            node_color = n_obs,
            node_label = taxon_names,
            tree_label = taxon_names)

Error:

Error in check_element_length(c("node_size", "edge_size", "node_label_size",  : 
  Length of argument'edge_size' must be a factor of the length of 'taxon_id'

I tried setting the edge_size to taxon_names but then I keep getting a similar error but for edge_label or node_label_size or edge_label_size, etc.

Please let me know what I'm doing wrong here. Thank you so much!

zachary-foster commented 6 years ago

Hi @tarunaaggarwal, the error is because you gave the obj to the heat_tree command even though you already passed it in via the %>%, so it was trying to use the object for the next undefined parameter, which happened to be edge_size. Sorry for the unhelpful error message. Its a common enough mistake that I should make some code to look for it (#231).

If you have not used %>% before: it takes what came before and uses it as the first input to the next function. For example, the two examples below do the exact same thing:

obj %>%
  filter_taxa(taxon_names == "Fungi", subtaxa = TRUE) %>%
  heat_tree(node_size = n_obs,
            node_color = n_obs,
            node_label = taxon_names,
            tree_label = taxon_names)
just_fungi <- filter_taxa(obj, taxon_names == "Fungi", subtaxa = TRUE)
heat_tree(just_fungi,
          node_size = n_obs,
          node_color = n_obs,
          node_label = taxon_names,
          tree_label = taxon_names)
grabear commented 6 years ago

@zachary-foster @tarunaaggarwal

Hello guys. From what I can tell the issue might not be with metacoder, but with the way you've parsed your data in phyloseq. After taking a look at your .biom files that you linked above it looks like you've used the SILVA database to annotate your data. Is that correct @tarunaaggarwal?

If you've used SILVA it designates ranks with the D_0__, D_1_, etc. prefixes. (Unlike GreenGenes which has prefixes that look something like k, p, c, etc.)

If so you can use the function that I've created and submitted as a PR to phylsoeq here: https://github.com/joey711/phyloseq/pull/854

If you have the function loaded, then you'll end up doing something like this:

# import biom data
silva_biom <- system.file("extdata", "SILVA_OTU_table.biom", package="phyloseq")
# Create phyloseq object using silva parseing function
silva_phyloseq <- import_biom(BIOMfilename = silva_biom, parseFunction = parse_taxonomy_silva_128)
grabear commented 6 years ago

I made a gist: https://gist.github.com/grabear/018e86413b19b62a6bb8e72a9adba349

tarunaaggarwal commented 6 years ago

Hey @zachary-foster - Thank you for explaining the %>% usage. I have only used it a couple of times in the past and I keep forgetting that it operates like the |. It worked! Check out the lovely graph. Once I get the for loop to work, I will post my code in a separate issue page for anyone in the future who wishes to use it.

Hi @grabear - I did indeed use the SILVA database. You caught that fast! đź‘Ť Thank you for sending your R code. I'm sure it will be a huge time saver for our lab since we use Phyloseq a bunch. So does this mean you use SILVA database as well? How do you like the SILVA taxonomy for Qiime? We have been thinking about fixing the taxonomy manually but it is such a HUGE task. Just wondering what your opinion is.

Thanks fellas! Appreciate your help!

screen shot 2018-04-27 at 10 44 53 am
grabear commented 6 years ago

The microbiome project I'm working on is the first one I've done. So it's also the first time I used metacoder. But in my journey I read this journal article.

I like that SILVA is more up to date (29/09/2016) vs GreenGenes (May 2013).

There's also this:

SILVA, being the largest of the three 16S based taxonomies, shares the most taxonomic units with NCBI

That whole article was helpful to me.

zachary-foster commented 6 years ago

Thanks for you input @grabear!

@tarunaaggarwal:

Great! I am glad it worked. Anyone know where ascomycota and basidomycota are in SILVA's taxonomy? Seems like they should be there.

Once I get the for loop to work, I will post my code in a separate issue page for anyone in the future who wishes to use it.

Cool, thanks!

tarunaaggarwal commented 6 years ago

Hey @zachary-foster - I'm getting back to metacoder this week so I'm sorry I just noticed that you asked a question in your response. I just grep'd for D_5Ascomycota and D_5__Basidiomycota in the taxonomy file that comes with SILVA (v132) and I found 3402 and 2443 OTUs containing **D_5Ascomycota and D_5__Basidiomycota**, respectively, using the consensus_taxonomy_all_levels.txt.

Were you using the file with 7 levels?

grabear commented 6 years ago

I believe Qiime uses the 7 level taxonomy file @tarunaaggarwal. Not sure of other software that uses SILVA, so that's my only reference.

tarunaaggarwal commented 6 years ago

Hey @grabear - I thought you can specify any taxonomy file with any number of levels in Qiime. Either way, if I need more levels, I just replace the taxonomy strings with all 14 levels using a quick Python script.

So how is Metacoder working out for you? I ask because I have been thinking of ways to make it work best for our lab. Ideally, I want to import the biom table (without taxonomy info), mapping file and a taxonomy file into metacoder without Phyloseq. Do you know how to do that?

zachary-foster commented 6 years ago

@tarunaaggarwal, I just ask because I did not see any fungal phyla in your plot and having them there would make it look better perhaps. Did they get filtered out before plotting?

tarunaaggarwal commented 6 years ago

Hey @zachary-foster! Oh I see. Those levels were not present in the 7 level taxonomy file we used to classify the OTUs I believe. Hence, they were missing. I have another question for you Zach. Is it possible to import the biom table (without taxonomy info), mapping file and a taxonomy file into metacoder without Phyloseq?

zachary-foster commented 6 years ago

So the biome file has an abundance matrix but no tax info? If you send me an example of each, I can probably tell you how to do it. Anything tabular for sure. I do have a biome parser as well, although I have never tried reading one without taxonomy info.

tarunaaggarwal commented 6 years ago

Thanks @zachary-foster ! Here is the folder containing both types of biom tables - with and without tax. The one without taxonomy info is accompanied with a taxonomy file. I hope I'm not creating too much work for you. THANKS for your help!

grabear commented 6 years ago

Is there some reason that you need to use a biom file without taxonomies?

tarunaaggarwal commented 6 years ago

@grabear Sort of. We just reassigned taxonomy with SILVA 132 and I'd rather just find a way to work with the new taxonomy within R than to have to add to the biom table and refilter all over again. If its not feasible, its not the end of the world.

zachary-foster commented 6 years ago

@tarunaaggarwal, no problem! I found a way to do it:

with taxonomy

library(metacoder)
with_tax <- parse_qiime_biom("biom-table-with-tax/otu_table_mc2_w_tax_BlankOTUsRemoved_BlankSamplesRemoved.biom")
print(with_tax)
## <Taxmap>
##   521 taxa: ab. D_0__Eukaryota ... ub. D_11__Norrlinia[truncated]
##   521 edges: NA->ab, ab->ac, ab->ad ... kd->tz, lq->ua, lr->ub
##   1 data sets:
##     otu_table:
##       # A tibble: 30,871 x 283
##         taxon_id otu_id   MEMB.nem.105 MEMB.nem.117 MEMB.nem.156
##         <chr>    <chr>           <dbl>        <dbl>        <dbl>
##       1 bh       FJ48040…           0.           0.           0.
##       2 ls       AF50812…           0.           0.           0.
##       3 lt       AY63045…           0.           0.           0.
##       # ... with 3.087e+04 more rows, and 278 more variables:
##       #   MEMB.nem.157 <dbl>, MEMB.nem.167 <dbl>,
##       #   MEMB.nem.168 <dbl>, MEMB.nem.176 <dbl>,
##       #   MEMB.nem.190 <dbl>, MEMB.nem.198 <dbl>,
##       #   MEMB.nem.200 <dbl>, MEMB.nem.22 <dbl>,
##       #   MEMB.nem.232 <dbl>, MEMB.nem.272 <dbl>, …
##   0 functions:

Without taxonomy

# Get OTU abundance
without_tax <- biomformat::read_biom("biom-table-without-tax/otu_table_mc2.biom")
## Warning in strsplit(msg, "\n"): input string 1 is invalid in this locale
otu_table <- dplyr::as_tibble(as.matrix(biomformat::biom_data(without_tax)))

# Get taxonomy file 
taxonomy <- readr::read_tsv("biom-table-without-tax/rep_set_tax_assignments.txt",
                            col_names = c("otu_id", "tax", "some_number"))
## Parsed with column specification:
## cols(
##   otu_id = col_character(),
##   tax = col_character(),
##   some_number = col_double()
## )
# Combine both in a taxmap object 
obj <- parse_tax_data(taxonomy,
                      class_cols = "tax", class_sep = ";",
                      datasets = list(otu_table = otu_table),
                      mappings = c("{{index}}" = "{{index}}"))
print(obj)
## <Taxmap>
##   2375 taxa: aab. D_0__Eukaryota ... dnj. D_14__
##   2375 edges: NA->aab, aab->aac ... czz->dni, daa->dnj
##   2 data sets:
##     tax_data:
##       # A tibble: 31,848 x 4
##         taxon_id otu_id    tax                       some_number
##         <chr>    <chr>     <chr>                           <dbl>
##       1 dab      New.Refe… D_0__Eukaryota;D_1__Opis…       0.700
##       2 amc      New.Refe… D_0__Eukaryota;D_1__Opis…       0.820
##       3 amd      New.Clea… D_0__Eukaryota;D_1__Opis…       1.00 
##       # ... with 3.184e+04 more rows
##     otu_table:
##       # A tibble: 31,848 x 305
##         taxon_id MEMB.nem.105 MEMB.nem.117 MEMB.nem.156
##         <chr>           <dbl>        <dbl>        <dbl>
##       1 dab                4.           1.          36.
##       2 amc                0.           0.           0.
##       3 amd                0.           0.           0.
##       # ... with 3.184e+04 more rows, and 301 more variables:
##       #   MEMB.nem.157 <dbl>, MEMB.nem.167 <dbl>,
##       #   MEMB.nem.168 <dbl>, MEMB.nem.176 <dbl>,
##       #   MEMB.nem.190 <dbl>, MEMB.nem.198 <dbl>,
##       #   MEMB.nem.200 <dbl>, MEMB.nem.22 <dbl>,
##       #   MEMB.nem.232 <dbl>, MEMB.nem.272 <dbl>, …
##   0 functions:

If you want to remove the "D_0__" and things like "(animals)" from the names, you can use the class_key and class_regex options to do that if you know some regex

tarunaaggarwal commented 6 years ago

@zachary-foster Nice! I will try this right away. And how to import the mapping file please?

zachary-foster commented 6 years ago

Oh yea, the mapping file. It has no taxonomic info associated with it, so it can stay a separate table. The taxmap object is only concerned with data that has taxonomic info associated with it, unlike phyloseq objects. You could put the mapping file in there, but it would not help anything. I like the readr package for tabular data:

mapping <- readr::read_tsv("18S-euk-QIIME-mapping-MEmicrobiome-FINAL-26Jul17.txt")

If you want, you could add it to the taxmap object like so:

obj$data$mapping <- readr::read_tsv("18S-euk-QIIME-mapping-MEmicrobiome-FINAL-26Jul17.txt")

obj$data is a list, so you can put anything you want there, but it will not make things easier unless what you put there is named by taxon IDs.

tarunaaggarwal commented 6 years ago

Morning @zachary-foster - may I please have your email address?

zachary-foster commented 6 years ago

Sure, its zacharyfoster1989 a gmail.com

tarunaaggarwal commented 6 years ago

Thanks! I emailed ya.

zachary-foster commented 5 years ago

Closing due to inactivity. If there are still unresolved issues, feel free to reopen this issue or open a new issue.

sunilmundra commented 5 years ago

Hi @zachary-foster I am trying to make a differential heat tree for two categories (birch vs spruce) Everything works fine till I plot overall tree, tree for individual categories, but when I am planning to make differential tree for comparing two categories based on log2 mean ration I am getting error (see code and error below). ###################

Differential tree

################### metacoder.data$taxon_names

first calculate the differences in abundances based on the two groupings

families$data$diff_table <- compare_groups(metacoder.data, data = "tax_by_host", #_by_host OR specify which information to use - here it is the abundance of the reads - you could also do the proportion in the samples using tax_data instead cols = c("Birch","Spruce"), #specify the names of the columns containing your OTU counts groups = c("Birch","Spruce")) families$data$diff_table$log2_median_ratio heat_tree(families, node_label = taxon_names, #specifies what names to give the circles (nodes) node_size = n_obs, #size nodes by total number of reads node_color = log2_median_ratio, #specifies the differences between groups

edge_color = log2_median_ratio,

      #edge_size = total,#thickness of lines determined by total # of reads
      node_color_interval = c(-2, 2),
      node_color_range = c("cadetblue", "grey75", "darkorange1"),
      node_label_size_range = c(0.007,0.06),#adjust the min-max numbers to change the relative size of the text
      node_size_axis_label = "Total Taxon Abundance",
      node_color_axis_label = "Differentially Occurring Samples")

Error in check_element_length(c("node_size", "edge_size", "node_label_size", : Length of argument'node_color' must be a factor of the length of 'taxon_id'

There seems some problems in defining node_color, and I am not able to figure out how should I get it correct?

Looking forward to hear from you

Regards Sunil

zachary-foster commented 5 years ago

Hi Sunil,

What does the print out for families look like before plotting? Thanks