grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
135 stars 28 forks source link

How to filter heattree from undesired taxa #252

Closed emankhalaf closed 5 years ago

emankhalaf commented 5 years ago

Hi, I have been using metacoder to visualize core microbiome as no. of obs (otus), and now I would like to display the core as read count (abundance). My otu_table is unfiltered, I mean it has non-sense taxa to my analyses such as chloroplasts and mitochondria. I wonder how can I clean the heattree from undesired taxa or apply any filtration criteria. Thanks

zachary-foster commented 5 years ago

Hi @emankhalaf! I can t give specific advice without knowing more about how your data is organized.

I have been using metacoder to visualize core microbiome as no. of obs (otus), and now I would like to display the core as read count (abundance).

First, you need to calculate read abundance per taxon. You can use calc_taxon_abund to do this. Look at some of the examples in ?calc_taxon_abund (function help info) an let me know if you have questions. Once you do that, you replace n_obs in your heat_tree code to whatever the column name of interest that calc_taxon_abund returns. There are many examples of this process in the documentation:

https://grunwaldlab.github.io/metacoder_documentation/example.html

My otu_table is unfiltered, I mean it has non-sense taxa to my analyses such as chloroplasts and mitochondria. I wonder how can I clean the heattree from undesired taxa or apply any filtration criteria.

Use filter_taxa to remove these. I would make a list of taxa you want to remove and filter on that like so:

taxa_to_remove <- c("chloroplast", "mitochondria", ...)
obj <- filter_taxa(obj, taxon_names %in% taxa_to_remove, invert = TRUE, subtaxa = TRUE)

This will remove all those taxa and any of their subtaxa.

Make sense?

emankhalaf commented 5 years ago

Thank you very much. I will give it a try and be back for my trial updates. One more thing, I have a script to remove D_0, D_1, .... from taxa display. I can use it with phyloseq object, but do not know how to remove taxa level code from the otu-table in a text format. Thanks again

zachary-foster commented 5 years ago

I can use it with phyloseq object, but do not know how to remove taxa level code from the otu-table in a text format.

by "text format" do you mean in a file? If so, you can use find ("D_") and replace ("") in a text editor restricted to just the area with the headers ("in selection").

if you mean in a taxmap object, you can use sub on the column names to do the same find and replace.

emankhalaf commented 5 years ago

Hi, I run the following:

my_table <- read.table("kit.txt", sep = "\t", header = TRUE)
my_table$OTU_ID <- as.character(my_table$OTU_ID)
my_table$taxonomy <- as.character(my_table$taxonomy)
taxa_to_remove <- c("D_3__Chloroplast", "D_4__Mitochondria")
my_tablef <- filter_taxa(my_table, taxon_names %in% taxa_to_remove, invert = TRUE, subtaxa = TRUE)

and got this error:

Error: Unsupported class: data.frame

emankhalaf commented 5 years ago

Hi, Sorry, I should apply the function on taxmap. I did as follow, however, I still have the same count?

my_table$OTU_ID <- as.character(my_table$OTU_ID)
my_table$taxonomy <- as.character(my_table$taxonomy)
parsed <- parse_tax_data(my_table, class_cols = "taxonomy", class_sep = ";")

Convert to proportions

sample_cols <- colnames(my_table)[startsWith(colnames(my_table), "X")]
parsed$data$tax_data <- calc_obs_props(parsed, "tax_data", cols = sample_cols)
taxa_to_remove <- c("Chloroplast", "Mitochondria")
parsedf <- filter_taxa(parsed, taxon_names %in% taxa_to_remove, invert = TRUE, subtaxa = FALSE, supertaxa = FALSE)

Sum counts per taxon

parsedf$data$taxon_counts <- calc_taxon_abund(parsedf, "tax_data", cols = sample_cols)

Get mean proportion

parsedf$data$taxon_counts$mean_prop <- rowMeans(parsedf$data$taxon_counts[, sample_cols])

Test plot

heat_tree(parsedf, 
          node_size = n_obs, 
          node_color = mean_prop,
          node_label = taxon_names)

IN phyloseq object, I should have 593 taxa, and here after filtration I got 630 taxa. Also, the tree branches in terms of abundance look different from the one generated without filtration. I am not sure if am doing it correctly or not? Thanks

emankhalaf commented 5 years ago

Hi, I am trying the script in the link (example document) that you provided. When come to step obj <- parse_tax_data(hmp_otus, class_cols = "taxonomy", # the column that contains taxonomic information class_sep = ";", # The character used to separate taxa in the classification class_regex = "^(.+)__(.+)$", # Regex identifying where the data for each taxon is class_key = c(tax_rank = "info", # A key describing each regex capture group tax_name = "taxon_name"))

I got this error: Error: The following 6 of 6 input(s) could not be matched by the regex supplied: Bacteria, Proteobacteria, Gammaproteobacteria, Enterobacteriales, Enterobacteriaceae, Pantoea 5. stop(call. = FALSE, paste0(collapse = "", c("The following ", sum(not_matching), " of ", length(input), " input(s) could not be matched by the regex supplied:\n", limited_print(input[not_matching], prefix = " ", type = "silent")))) 4. validate_regex_match(x, class_regex) 3. FUN(X[[i]], ...) 2. lapply(parsed_tax, function(x) { validate_regex_match(x, class_regex) output <- data.frame(stringr::str_match(x, class_regex), stringsAsFactors = FALSE) ... 1. parse_tax_data(my_table, class_cols = "taxonomy", class_sep = ";", class_regex = "^(.+)__(.+)$", class_key = c(tax_rank = "info", tax_name = "taxon_name"))


I do not know what does it mean. By the way the order of columns in my otu table is different from the posted example. I mean it starts with OTU_ID, samples, then taxonomy. Whereas the posted example has taxonomy (lineage) after OTU_ID. Does it make any difference in analysis. Also, I need to identify taxa location, so I can filter the taxmap correctly. Your help is much appreciated.

zachary-foster commented 5 years ago

Its hard for me to say what is going on without running the code or seeing output. If you send be "kit.txt" it would probably be easier to debug. You can email it to me if you would rather not post it here.

IN phyloseq object, I should have 593 taxa, and here after filtration I got 630 taxa.

Odd, how did you get that number for the phyloseq? Phyloseq might be calculating it differently (e.g. things like "unknown" in multiple places counting as one).

I am trying the script in the link (example document) that you provided.

Are you using the provided dataset and getting that error? That error means that regex "^(.+)__(.+)$" cannot match Gammaproteobacteria etc, which makes sense because there is no underscore in Gammaproteobacteria. Since they are plain taxon names you can omit the class_regex and class_key options and it should work.

By the way the order of columns in my otu table is different from the posted example.

Yea, almost every dataset I see is different from eachother.

Does it make any difference in analysis.

Not really, but the particular code used will change.

Also, I need to identify taxa location, so I can filter the taxmap correctly.

What do you mean by "taxon location"

emankhalaf commented 5 years ago

Yes, when remove class_regex and class_key, it works. But, when filter taxmap from otu with read count <10, in the confirmation step no_reads <- rowSums(parsed$data$tax_data[, kit_samples$SampleID]) == 0 sum(no_reads)

I found sampleIDs in metadata file are different from sampleIDs in taxmap (as X was added pre-each sampleID). Also, I still see chloroplast in the tree as cyanobacteria. May you please post your e.mail, so I can send the files. I need to proceed to differential abundance responsive to treatment, but I can not. Thanks in advance

zachary-foster commented 5 years ago

Sure, it is "zacharyfoster1989@gmail.com"

zachary-foster commented 5 years ago

I have emailed you an example analysis that might help.

I found sampleIDs in metadata file are different from sampleIDs in taxmap (as X was added pre-each sampleID).

They were added by read.table, because the the column names started with numbers, which base R does not like. In the example I sent you over email, I use the readr package which avoids this.

Also, I still see chloroplast in the tree as cyanobacteria.

I was able to filter them out using your code for the most part. Had to add subtaxa = TRUE. see email

IN phyloseq object, I should have 593 taxa, and here after filtration I got 630 taxa.

I am not seeing either of those numbers, but there are a lot of placeholder taxa ("uncultured..", "Ambiguous_taxa"). Phyloseq might be counting those differently because they are in multiple locations in the taxonomy.

emankhalaf commented 5 years ago

Hello,

Thank you very much, I just still have two more inquiries: 1- for taxa that log2 median ratio with negative value, that means the abundance of these taxa was significantly reduced upon pathogen treatment (of course in case I am using p-value <0.05 as a threshold for displaying data), is that true? so we can calculate differential abundance as +/- impact of the treatment. 2- About the node color label; is it OTU count per taxon, or No. of taxa (I can say it is no. of taxa, taken in account OTU count per each taxon). But which label is more accurate?

Thanks again

zachary-foster commented 5 years ago

1- for taxa that log2 median ratio with negative value, that means the abundance of these taxa was significantly reduced upon pathogen treatment (of course in case I am using p-value <0.05 as a threshold for displaying data), is that true? so we can calculate differential abundance as +/- impact of the treatment.

Yea, depending on order of treatment_1 and treatment_2 in the output. One of the resons I use the log of the ratio is that 0 means no difference and differences are symmetric around zero. For example, a log2 median ratio of -2 means the median abundance of "treatment_2" is 4 times larger than "treatment_1"

> log2(4/1)
[1] 2
> log2(1/4)
[1] -2

2- About the node color label; is it OTU count per taxon, or No. of taxa (I can say it is no. of taxa, taken in account OTU count per each taxon). But which label is more accurate?

If you use n_obs on an OTU table, then it is number of OTUs per taxon. n_subtaxa returns number of subtaxa per taxon, but I doubt that is what want in this case.

emankhalaf commented 5 years ago

Oh, great, your answers improve my understanding to the code. I think in my case it is OTU count per taxon. I also used heat-tree function with phyloseq object filtered by using a specific prevalence threshold, that returns a limited no. of taxa (for example 15 taxa) and the tree display (node size and color) is mainly based on taxa count (not OTU count) as it is prevalence not abundance. is that correct as well?

emankhalaf commented 5 years ago

Hi Zac, As I try to understand why no. of taxa is different in metacoder from what I got in phyloseq after filtering Mitochondria and chloroplasts. I wonder if using the following code: filter_taxa(parsed, taxon_names %in% taxa_to_remove, subtaxa = TRUE, supertaxa = TRUE) should remove the entire feature (OTU) of filtered taxa not just only the subtaxa, that is why there is a difference in taxa count? Also, when I re-run the code I got this error: Error in filter_taxa(parsed, taxon_names %in% taxa_to_remove, subtaxa = TRUE, : unused arguments (subtaxa = TRUE, supertaxa = TRUE)

3. eval(lhs, parent, parent) 2. eval(lhs, parent, parent) 1. filter_taxa(parsed, taxon_names %in% taxa_to_remove, subtaxa = TRUE, supertaxa = TRUE) %>% print_tree()

even if I use only subtaxa = TRUE, I got an error. I re-checked the files and found no problem (they are the same)

May you please help me to figure it out? Best

zachary-foster commented 5 years ago

Hello @emankhalaf,

. I wonder if using the following code: filter_taxa(parsed, taxon_names %in% taxa_to_remove, subtaxa = TRUE, supertaxa = TRUE) should remove the entire feature (OTU) of filtered taxa not just only the subtaxa, that is why there is a difference in taxa count?

Did you forget to include invert = TRUE? That code actually removes everything not in "taxa_to_remove", which is not what you intended I think.

Error in filter_taxa(parsed, taxon_names %in% taxa_to_remove, subtaxa = TRUE, : unused arguments

You might have loaded phyloseq after metacoder and phyloseq::filter_taxa is being used on a taxmap object, which will not work. Unfortunaty both packages have a function called "filter_taxa". Try using "taxa::filter_taxa" instead of "filter_taxa" to specific which function to use.

As I try to understand why no. of taxa is different in metacoder from what I got in phyloseq after filtering Mitochondria and chloroplasts.

If you send me the code for doing the comparison, i can probably tell you why. Hard to say otherwise. I am curious. I expect it has to do with the difference in how metacoder and phyloseq store the taxonomy information.

emankhalaf commented 5 years ago

I will send you the code. However, I detached phyloseq and the function is working now. When I used subtaxa and supertaxa = True, so nothing is remained in the table, as either has the same root (Bacteria). It sounds that there is no way to remove mitochondria and chloroplast entirely (entire line) from taxmap. Please help, I like metacoder taxonomic representation.

zachary-foster commented 5 years ago

It sounds that there is no way to remove mitochondria and chloroplast entirely (entire line) from taxmap

Are you taking about the taxonomy table that is included when converting a phyloseq object or the taxonomy defined in the taxmap object? By default, when taxa are removed from the taxonomy, and data assigned to those taxa are reassigned to a supertaxon that was not removed. If you dont want this to happen, use the reassign_obs option. That either takes TRUE/FALSE or a value for each table. For example:

library(phyloseq)
library(metacoder)
data(GlobalPatterns)
x <- parse_phyloseq(GlobalPatterns)

taxa_to_remove <- c("Sulfolobales")
filter_taxa(x, taxon_names %in% taxa_to_remove, subtaxa = TRUE, invert = TRUE, reassign_obs = FALSE)
filter_taxa(x, taxon_names %in% taxa_to_remove, subtaxa = TRUE, invert = TRUE, reassign_obs = c(tax_data = FALSE))

Does that do what you want?

Also, if you are talking about the taxonomy table called "tax_data" that is included by "parse_phyloseq", then you don't really need to worry about subsetting it, but the above method should work for it too. Thats just another table in a taxmap object and could be deleted without losing any information, since all the taxonomy information is stored in the taxmap itself.

emankhalaf commented 5 years ago

In phyloseq, I already filtered mitochondria and chloroplast, then I subset the object to visualize a specific group of samples, then I used the following:


x1 <- parse_phyloseq(crop_working)

# Plot data
heat_tree(x1,
          node_size = n_obs,
          node_color = n_obs,
          node_label = taxon_names,
          node_color_axis_label = "OTU count per taxon",
            layout = "davidson-harel", 
            initial_layout = "reingold-tilford") 
But the displayed tree includes the taxa of the entire object not the subset. I do not know what is wrong in the code!
Thanks
emankhalaf commented 5 years ago

To solve the problem, I created separate objects for each group of samples (based on sequences) and now is working. But, I noticed sth, that if the no. of obs is for example 1579 (shown in R environment), the key of the tree showed the maximum OTU count 1560. Does parsing filter the OTU count? Also, the tree key shows only OTU count/taxon not including the other statistical information (mean-prop) displayed on the other side of the key. (Just a plain display for the microbial community) I wonder which is more acceptable, to display the tree with only OTU count or return back to taxonomy table and try to filter mitochondria and chloroplasts lines manually and run metacoder code? Sorry for the lengthy conversation, but I really need to know which is more scientifically correct?

emankhalaf commented 5 years ago

Hi Zac, I did a manual filtration for the taxonomy table and re-run the code. I got the same count as in parse-phyloseq, but the node colors are different. I will send you both htmls in case of using phyloseq with metacoder and metacoder only(taxonomy table). Also, when I run parsed$data$taxon_counts <- calc_taxon_abund(parsed, "tax_data") then heat-tree to display the tree by only read count, I got the following error: Error in eval(x$expr, data, x$env) : object 'taxon_counts' not found

Lastly, When I checked your metacoder publication, in figure 3 how can you display two legends for the same tree. The tree is very informative showing the contribution of each taxon in the total read count (microbial community). May you please provide me with guidelines to generate similar ones. I have 8 sets of data and I would like to compare between them visually.

zachary-foster commented 5 years ago

Hi @emankhalaf,

I will be camping for the next few days without internet, so I will try to figure this out on Monday.

emankhalaf commented 5 years ago

No problem. Have a nice weekend.

emankhalaf commented 5 years ago

Hi@Zachary Foster, Something weird I noticed, sometimes the tree includes a separate branch of archaea (diseased tissues), and sometimes does not (healthy tissues), however archaea read count is reduced upon infection, which means they should be more abundant and represented in healthy tree. May you please explain that to me! Thanks

zachary-foster commented 5 years ago

But, I noticed sth, that if the no. of obs is for example 1579 (shown in R environment), the key of the tree showed the maximum OTU count 1560. Does parsing filter the OTU count?

No, the parsing functions do not remove any data. That is probably the maximum number of OTUs for a single taxon (Bacteria I would guess), but you also have some other things like Archea taking up some of those total OTUs.

Also, the tree key shows only OTU count/taxon not including the other statistical information (mean-prop) displayed on the other side of the key. (Just a plain display for the microbial community) I wonder which is more acceptable, to display the tree with only OTU count or return back to taxonomy table and try to filter mitochondria and chloroplasts lines manually and run metacoder code?

I am not sure I understand. I get the impression that you are trying to manipulate the "tax_data" taxonomy table in order to influence which taxa are printed in the tree. This will not work, because the "tax_data" table is not used to define the taxonomy after using "parse_phyloseq". You can delete or ignore that table; it is only useful for phyloseq objects. Deleting rows from that table will not change the number of taxa. The taxonomy in a taxmap object (format used by metacoder) is stored separately from the user-defined tables.

Also, when I run parsed$data$taxon_counts <- calc_taxon_abund(parsed, "tax_data") then heat-tree to display the tree by only read count, I got the following error: Error in eval(x$expr, data, x$env) : object 'taxon_counts' not found

'taxon_counts' is the name of the table, you need to use the name of the column in that table with the counts you want to plot. To get counts for groups of samples, use the "groups" option.

Lastly, When I checked your metacoder publication, in figure 3 how can you display two legends for the same tree.

The second legend is for edge size/color. If you use the edge_color or edge_size options for heat_tree, the second legend should automatically appear (if the edge sizes mean something different than the node sizes). You can see an example of this in the "Setting edge attributes" section here:

https://grunwaldlab.github.io/metacoder_documentation/workshop--05--plotting.html

That example also shows plotting read depth per taxon for sample groups.

Something weird I noticed, sometimes the tree includes a separate branch of archaea (diseased tissues), and sometimes does not (healthy tissues), however archaea read count is reduced upon infection, which means they should be more abundant and represented in healthy tree.

I cant say why that would happen without seeing the code/data.

emankhalaf commented 5 years ago

Sorry for my delayed reply as I was in a sick leave. Thank you very much for your answers, really very helpful. With regard to the last question, I will send you the code and the data. (I can see archaea branch when I use parse-phyloseq), but not always displayed in metacoder?? Thanks again

zachary-foster commented 5 years ago

It is not displayed in the heat trees because of the filter_taxa(taxon_names == "Bacteria", subtaxa = TRUE) before the heat tree call:

parsedf %>%
  filter_taxa(taxon_names == "Bacteria", subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
            node_color = mean_prop, # A column from `obj$data$diff_table`
            node_size_axis_label = "OTU count per taxon",
            node_color_axis_label = "mean_prop",
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford", # The layout algorithm that initializes node locations
            output_file = "bacteria_dif.pdf")

That removes any taxa that are not in bacteria before plotting. If you look at the object, archea are there:

> parsedf
<Taxmap>
  1294 taxa: aab. Bacteria, aac. Archaea, aad. Proteobacteria ... byj. Ambiguous_taxa, byk. Ambiguous_taxa, byl. uncultured soil bacterium
  1294 edges: NA->aab, NA->aac, aab->aad, aab->aae, aab->aaf, aab->aag, aab->aah ... bhv->byg, bjp->byh, bjq->byi, bgk->byj, bao->byk, bjr->byl
  2 data sets:
    tax_data:
      # A tibble: 1,638 x 43
        taxon_id `16T20` `16T71` `16T89` `16T35` `16T91` `16T93` `16T28` `16T47` `16T80` `16T13` `16T27` `16T43`  `16T1` `16T63` `16T66` `16T58` `16T84`
        <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
      1 atu      0.00718  0.0379  0.256  0.883    0.0569  0.0764 0.0238  0.0302   0.0191 8.59e-2  0.601   0.135  0.490    0.198  0.114   0.00822  0.0561
      2 aka      0.0242   0.0634  0.0283 0.0152   0.0908  0.0115 0.00861 0.440    0.848  6.04e-3  0.0136  0.0407 0.00752  0.0232 0.00205 0.0681   0.0463
      3 atv      0.00740  0.150   0      0.00647  0.0237  0.138  0.00188 0.00955  0.0258 2.68e-4  0       0.0154 0.0499   0.0557 0.337   0.0219   0.0556
      # ... with 1,635 more rows, and 25 more variables: `16T87` <dbl>, `16T39` <dbl>, `16T42` <dbl>, `16T6` <dbl>, `16T16` <dbl>, `16T62` <dbl>,
      #   `16T94` <dbl>, `16T26` <dbl>, `16T48` <dbl>, `16T88` <dbl>, …
    taxon_counts:
      # A tibble: 1,294 x 44
        taxon_id `16T20`  `16T71` `16T89` `16T35` `16T91` `16T93` `16T28` `16T47` `16T80` `16T13` `16T27` `16T43` `16T1` `16T63` `16T66` `16T58` `16T84`
      * <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
      1 aab       0.956  10.00e-1 0.994     1     9.99e-1   1      0.933    1       1       1       1       1      1       1       1     0.998   9.99e-1
      2 aac       0.0435  9.73e-5 0.00635   0     8.50e-4   0      0.0665   0       0       0       0       0      0       0       0     0.00162 8.90e-4
      3 aad       0.547   8.89e-1 0.747     0.968 8.34e-1   0.622  0.468    0.936   0.983   0.941   0.952   0.847  0.957   0.977   0.825 0.518   6.23e-1
      # ... with 1,291 more rows, and 26 more variables: `16T87` <dbl>, `16T39` <dbl>, `16T42` <dbl>, `16T6` <dbl>, `16T16` <dbl>, `16T62` <dbl>,
      #   `16T94` <dbl>, `16T26` <dbl>, `16T48` <dbl>, `16T88` <dbl>, …
  0 functions:

If you remove the filtering before the plot, you will see a small tree for archaea next to the bacteria tree (it is a different tree because there is no "life" taxon that they are both a part of). Since the archaea are so much less diverse, I would just plot them on their own tree so they dont get lost visually in all the bacteria:

parsedf %>%
    filter_taxa(taxon_names == "Archaea", subtaxa = TRUE) %>%
    heat_tree(node_label = taxon_names,
              node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
              node_color = mean_prop, # A column from `obj$data$diff_table`
              node_size_axis_label = "OTU count per taxon",
              node_color_axis_label = "mean_prop",
              layout = "davidson-harel", # The primary layout algorithm
              initial_layout = "reingold-tilford", # The layout algorithm that initializes node locations
              output_file = "archaea.png")

archaea

Or maybe just display that info in a table or in the text, since there are so few.

zachary-foster commented 5 years ago

Closing due to inactivity. If there are still unresolved issues, feel free to reopen this issue or open a new issue.

emankhalaf commented 5 years ago

Thanks so much for your help and sorry for not responding as I had head injury leading to concussion since December 2nd and so far I am not able to look at any monitor, just very limited use of cell phone. Thanks again. Eman

On Mon, Jan 14, 2019, 6:46 PM Zachary Foster <notifications@github.com wrote:

Closed #252 https://github.com/grunwaldlab/metacoder/issues/252.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grunwaldlab/metacoder/issues/252#event-2073261586, or mute the thread https://github.com/notifications/unsubscribe-auth/AZOrV0j310V6egRz3HJ0jjFiNavbq0wKks5vDRbKgaJpZM4YapVl .