grunwaldlab / metacoder

Parsing, Manipulation, and Visualization of Metabarcoding/Taxonomic data
http://grunwaldlab.github.io/metacoder_documentation
Other
136 stars 28 forks source link

Adding color to nodes based on specific values of observations. #210

Open grabear opened 6 years ago

grabear commented 6 years ago

Based on my current workflow, I'm generating a taxmap object with metacoder. And then I'm plotting the tree like this:

# Phyloseq...
# Data manipulation....
# Data manipulation....
# Metacoder/Taxmap
...
heat_tree(title = sprintf("Bacterial Abundance at %s Level", rank_list[[i]]),
              # NODE
              ## The node size is relevant to the Abundance level
              ## The node color is relevant to wheather the abundance is higher in control vs stressed animals
              ## The node labels are relevant to significant taxon names.
              node_size = n_obs,
              node_color = log2_median_ratio,
              node_label = ifelse(wilcox_p_value < 0.05, taxon_names, NA),
              node_label_size = 1,
              ### The color red indicates higher abundance in Stressed animals
              ### The color blue indicates higher abundance in Control animals
              ### The color grey represents no difference in Control vs Stressed abundance
              ### Colorblind node_color_range = c("navy", "grey80", "greenyellow"),
              ### Colorblind node_color_range = c("navy", "grey80", "yellow3"),
              node_color_range = c("blue", "grey90", "red"),
              node_color_trans = "linear",
              node_color_interval = c(-4, 4),
              node_size_axis_label = "Size: Number of OTUs",
              node_color_axis_label = "Color: Increased in Stressed (red) vs.\n Increased in Control (blue)\n",
              ### The labels are only for significant (pvalue < 0.05) abundance changes
              ### The labels are green to offset the blue/red colors.
              node_label_color = wilcox_p_value,
              node_label_color_range = c("darkgreen"),
              node_label_max = 1000,
              # EDGE (Branch)
              ## The edge size is relevant to the level of abundance.
              ## The edge color is relevant to significant levels of abundance.
              edge_color = wilcox_p_value,
              ### The color interval is set from 0 to 1 (p-value is a percentage).
              ### The color range is a vector consisting of a dark color followed 
              ### by 19 of the same color.  This divides the interval by 20 so that
              ### every edge below 0.05 (5%; 1/20) is the dark color.  
              ### So our p-value is the dark color.
              edge_color_range = c("honeydew2", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50"),
              edge_color_trans = "linear",
              edge_color_interval = c(0, 1),
              edge_size_axis_label = "Size: Number of OTUs",
              edge_color_axis_label = "Color: Significant Changes \nAmong Treatments", 
              # PLOT Options
              initial_layout = "reingold-tilford",
              layout = "davidson-harel", 
              repel_labels = TRUE, 
              repel_force = 3,
              overlap_avoidance = 3, 
              make_edge_legend=FALSE)

For my edge_color you can see that I did some fairly hacky things. As you can see below I wanted to use the p value (<0.05 is grey50) to color the edges. In order to do this I set my interval up with c(0,1), and the color range is a list with the significant color (honeydew2), and 19 other values (grey50). While I'm aware I could probably use _edge_color = ifelse(wilcox_p_value < 0.05, 0, 1) and edge_colorrange = c("honeydew2", "grey50"), this works for now and is irrelevant to my point.

## The edge size is relevant to the level of abundance.
              ## The edge color is relevant to significant levels of abundance.
              edge_color = wilcox_p_value,
              ### The color interval is set from 0 to 1 (p-value is a percentage).
              ### The color range is a vector consisting of a dark color followed 
              ### by 19 of the same color.  This divides the interval by 20 so that
              ### every edge below 0.05 (5%; 1/20) is the dark color.  
              ### So our p-value is the dark color.
              edge_color_range = c("honeydew2", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50", "grey50"),
              edge_color_trans = "linear",
              edge_color_interval = c(0, 1),

My main concern is the node color:

             ## The node labels are relevant to significant taxon names.
              node_color = log2_median_ratio,
              ### The color red indicates higher abundance in Stressed animals
              ### The color blue indicates higher abundance in Control animals
              ### The color grey represents no difference in Control vs Stressed abundance
              ### Colorblind node_color_range = c("navy", "grey80", "greenyellow"),
              ### Colorblind node_color_range = c("navy", "grey80", "yellow3"),
              node_color_range = c("blue", "grey90", "red"),
              node_color_trans = "linear",
              node_color_interval = c(-4, 4),

According to the docs for the heat_tree object in the color section:

Colors

The colors of nodes, edges, labels, and trees can be mapped to arbitrary numbers. This is useful for highlighting groups of taxa. Only the relative size of numbers is used, not the values themselves.

Is there any way to set up the node color scale so that the numbers represent the actual values (log2_median_ratio in my case), and so that the colors span the desired ranges (negative numbers are red, positive are blue in my case)? Or is there at least a way to map the node_color_interval to the actual data?

My current plans are to add a column to the diff_table using (using mutateobs) that contains : <Taxon_name (log2_medianratio)> And then I'll use that variable as the node_label. And this will at least let us see the corresponding values.

zachary-foster commented 6 years ago

Is there any way to set up the node color scale so that the numbers represent the actual values (log2_median_ratio in my case), and so that the colors span the desired ranges (negative numbers are red, positive are blue in my case)?

I am not sure I understand. To make the colors, only relative differences are used, but the values passed in are what is printed on the legend. Since you have:

              node_color_range = c("blue", "grey90", "red"),
              node_color_trans = "linear",
              node_color_interval = c(-4, 4),

it should work the way I imagine you want. You will be displaying log2_median_ratios between -4 and 4, with no transformation ("linear") and 0 will be "grey90". Any log2_median_ratio greater than or equal to 4 will be red etc. Is that not how it looks or is there something else you want to do?

Or is there at least a way to map the node_color_interval to the actual data?

node_color_interval accepts numbers from the data, "log2_median_ratio" in this case, so c(-4, 4) means "log2_median_ratio"s between -4 and 4. The *_range options are the ones that take values for plotting.

Sorry if that is not helpful, I am not sure what the problem is yet

grabear commented 6 years ago

I know this is confusing. Part of my issue was with really understanding what the interval was doing. Now that I get it, I will just leave the node_color_interval alone.

The problem is with the node_color_range. Since my log2_median_ratio has a minimum of -6 and a maximum of 8, the color scale is not where I desire it to be. 0.0281 is "grey90" instead of 0.

Does this make more sense? I feel like of overshared lol.

grabear commented 6 years ago

In particular for log2_median_ratio this problem that I'm having with the node_color_range, doesn't help with representing the data properly.

# formula used for log2_median_ratio in example diff table
...
  log_ratio <- log2(median(abund_1) / median(abund_2))

Higher abund_1 means a positive number. Higher in abund_2 means a negative number.

zachary-foster commented 6 years ago

When I use a diverging scale with node_color_range, I always set node_color_interval to be symmetric, so things get centered at 0. It will still show the actual values on the legend (node_color_interval does not scale anything if thats what you are concerned with), but values outside the interval will look the same as the values on the limit. For example if node_color_interval = c(-1, 1) and you plot a value of 2, it will have the same color as a value of 1.

So, why not set node_color_interval to c(-6, 6) or c(-8, 8)? That way it is centered on 0. You can also automate it with something like c(-1, 1)*max(abs(range(log2_median_ratio))) or c(-1, 1)*min(abs(range(log2_median_ratio)))

zachary-foster commented 6 years ago

Maybe there is a bug? Can you attach a the graph that results from the code you posted?

grabear commented 6 years ago

Setting the interval that way to center it on 0 is the reason I was using c(-4, 4). For some reason I interpreted the interval as a parameter that scaled the node color values. But instead of altering the scale (which is holistic), you alter what range you're looking at (which is like subsetting).

It will still show the actual values on the legend (node_color_interval does not scale anything if thats what you are concerned with), but values outside the interval will look the same as the values on the limit.

Thank you for that, and the proceeding explanation. I have a much better idea of what's happening.

When setting thenode_color_interval to NULL, I'm assuming that it reverts to the min/max range. Or c(-6, 8). So the legend wouldn't be symmetric, and the center would be off. If that's the case, then there should be no bug.

species class

The above were generated by:

# Other manipulations
...
# Change i to 3 for class and 7 for species
htrees[[rank_list[[i]]]] <- test %>% filter_taxa(n_supertaxa < i) %>% 
  heat_tree(title = sprintf("Bacterial Abundance at %s Level", rank_list[[i]]),
            # NODE
            ## The node size is relevant to the Abundance level
            ## The node color is relevant to wheather the abundance is higher in control vs stressed animals
            ## The node labels are relevant to significant taxon names.
            node_size = n_obs,
            node_color = log2_median_ratio,
            node_label = ifelse(wilcox_p_value < 0.05, Taxon_l2mr, NA),
            node_label_size = 1,
            ### The color red indicates higher abundance in Stressed animals
            ### The color blue indicates higher abundance in Control animals
            ### The color grey represents no difference in Control vs Stressed abundance
            ### Colorblind node_color_range = c("navy", "grey80", "greenyellow"),
            ### Colorblind node_color_range = c("navy", "grey80", "yellow3"),
            node_color_range = c("blue", "grey90", "red"),
            node_color_trans = "area",
            #node_color_interval = c(-6.26339904556416, 8.05383581749738),
            node_size_axis_label = "Size: Number of OTUs",
            node_color_axis_label = "Color: Increased in Stressed (red) vs.\n Increased in Control (blue)\n",
            ### The labels are only for significant (pvalue < 0.05) abundance changes
            ### The labels are green to offset the blue/red colors.
            node_label_color = wilcox_p_value,
            node_label_color_range = c("darkgreen"),
            node_label_max = 1000,
            # EDGE (Branch)
            ## The edge size is relevant to the level of abundance.
            ## The edge color is relevant to significant levels of abundance.
            edge_color = ifelse(wilcox_p_value < 0.05, 0, 1),
            ### The color interval is set from 0 to 1 (p-value is a percentage).
            ### The color range is a vector consisting of a dark color followed 
            ### by 19 of the same color.  This divides the interval by 20 so that
            ### every edge below 0.05 (5%; 1/20) is the dark color.  
            ### So our p-value is the dark color.
            edge_color_range = c("honeydew2", "grey50"),
            edge_color_trans = "linear",
            edge_color_interval = c(0, 1),
            edge_size_axis_label = "Size: Number of OTUs",
            edge_color_axis_label = "Color: Significant Changes \nAmong Treatments", 
            # PLOT Options
            initial_layout = "reingold-tilford",
            layout = "davidson-harel", 
            repel_labels = TRUE, 
            repel_force = 3,
            overlap_avoidance = 3, 
            make_edge_legend=FALSE)

If this isn't enough, then I can add you to my private repository that contains my analysis.

zachary-foster commented 6 years ago

As far as I can see, that output looks right. I am happy to see that newlines work in the legend axis labels; I dont think that was ever tested.

When setting the node_color_interval to NULL, I'm assuming that it reverts to the min/max range. Or c(-6, 8). So the legend wouldn't be symmetric, and the center would be off. If that's the case, then there should be no bug.

Yea, the default it to show the whole range. The heat_tree function does not know what kind of plot you are trying to make so it just shows all the data. If I could think of a way of identifying that a diverging scale is being used, I could change the default behavior, but it sounds a bit too complicated for a minor gain.

I did just have the idea that maybe the raw stats that colors are mapped to could be specified for non-linear mappings. Something like node_color_range = c("blue", "grey90" = 0, "red") or node_color_range = c("blue", "0" = "grey90", "red"), meaning that 0 values would be colored "grey90" and the other would be, by default, the min and max. Then you could shown the whole range and still be centered on 0.

Sorry for the confusion. Are there any parts of the documentation that could be improved that would have made things easier to understand?

grabear commented 6 years ago

I did just have the idea that maybe the raw stats that colors are mapped to could be specified for non-linear mappings. Something like node_color_range = c("blue", "grey90" = 0, "red") or node_color_range = c("blue", "0" = "grey90", "red"), meaning that 0 values would be colored "grey90" and the other would be, by default, the min and max. Then you could shown the whole range and still be centered on 0.

This is what I was looking for! Through this thread I realized what I wanted wasn't available.

Sorry for the confusion. Are there any parts of the documentation that could be improved that would have made things easier to understand?

No worries! We just have different perspectives. But I'll look at the interval/range documentation tomorrow, and make a PR linked to here.

grabear commented 6 years ago

Something like node_color_range = c("blue", "grey90" = 0, "red") or node_color_range = c("blue", "0" = "grey90", "red"), meaning that 0 values would be colored "grey90" and the other would be, by default, the min and max

This could also be implemented by changing how the interval works. So that you could do something like the following:

element_interval <- c(-6, 0, 8)
# or
element_interval <- c(-6, -4, -2, 0, 2, 4, 6, 8)
# or takes functions
element_interval <- c(min, median, max)

Example implementation:

# for each interval have a color
node_color_range <- c("red", "blue")
node_color_interval <- c(-6, 0, 8) # this would fade into blue at 0 without being symetrical
# or
node_color_range <- c("red", "grey", "blue")
node_color_interval <- c(-6, 0, 8)

Just brainstorming. Take it with a grain of salt. Working on docs now.

grabear commented 6 years ago

212

zachary-foster commented 6 years ago

Thanks for the ideas! I will have to think about this more. It would be nice to make a color center on a value, but I am not sure of the best way to implement this yet.