microbiome / miaViz

Microbiome Analysis Plotting and Visualization
https://microbiome.github.io/miaViz
Artistic License 2.0
10 stars 12 forks source link

plotAbundance improvements #132

Closed TuomasBorman closed 4 weeks ago

TuomasBorman commented 5 months ago

1.

When sample names are plotted, one cannot read them as they are over each other

library(miaViz)
data("GlobalPatterns")

tse <- GlobalPatterns
plotAbundance(tse, rank = "Phylum", add_x_text = TRUE)

image Some other functions seem to have angle_x_text parameter, but plotAbundance does not have option to rotate text.

Also, we could consider if sample names could be specified from colData(tse). For example, paired samples must have unique names currently, but better option would be to allow shared names so that one can easily see which samples are drawn from same patient.

2.

I user wants to compare abundances between groups or if samples are paired for instance, our solution might be suboptimal.

library(patchwork)
library(miaViz)
data("GlobalPatterns")

tse <- GlobalPatterns
p <- plotAbundance(tse, rank = "Phylum", features = "SampleType")
wrap_plots(p, ncol = 1,  heights = c(0.95,0.05))

image It might be hard to read the plot when there are multiple groups (space between groups might help).

Another option would be to plot abundances as shown here in figure 1b

TuomasBorman commented 5 months ago

Also consider plotting more than 20 (maybe 25) taxa with discrete colors. As seen in plots above, the colors are in continuous scale which makes it hard to read. If there are 20 or less taxa, the color scale is discrete.

Daenarys8 commented 3 months ago

Also related: https://github.com/microbiome/OMA/issues/197

Daenarys8 commented 3 months ago

There are three options to display sample names without cluttering.

TuomasBorman commented 3 months ago

Thanks theme(axis.text.x = element_text(angle = 45, hjust = 1)) seems to solve the problem of sample names.

Couple more things came to my mind while generating plots in one project

# Prepare data
library(miaViz)
data("Tengeler2020")
tse <- Tengeler2020
tse <- tse[, 1:20]

colData(tse)[["patient"]] <- rep(paste0("patient", seq_len(4)), each = ncol(tse) / 4)
colData(tse)[["sampletype"]] <- factor(rep(paste0("sampletype", seq_len(2)), ncol(tse) / 10))
tse <- tse[, 1:19]
  1. Order of taxa

Sometimes user wants to define the order of taxa. For instance, there might be some specific taxa that user wants to be listed first. For example, here in figure 3 they have plotted "Other" first: https://www.researchgate.net/publication/347867791_The_Urinary_Microbiome_in_Postmenopausal_Women_with_Recurrent_Urinary_Tract_Infections/figures

For instance, below Firmicutes is plotted first. I am not sure what is the best way to achieve the desired behavior. (Maybe we could check if values are factors and get the order from levels?)

asd <- c("Firmicutes" = "1_Firmicutes")
rowData(tse)[["Phylum"]][ rowData(tse)[["Phylum"]] == names(asd) ] <- asd
plotAbundance(tse, rank = "Phylum", as.relative = TRUE)

image

  1. Displaying column variable

When we want to display sample type, for instance, the type is plotted as colors. However, it might be better to have it as own facet?

Below is our current solution

p <- plotAbundance(tse, rank = "Phylum", as.relative = TRUE, col.var = "sampletype", order.col.by = "sampletype")
library(patchwork)
wrap_plots(p, ncol = 1, heights = c(0.95,0.05))

image

Behind the link, in figure 2, you can see how the same thing is achieved with facets: https://www.researchgate.net/publication/347867791_The_Urinary_Microbiome_in_Postmenopausal_Women_with_Recurrent_Urinary_Tract_Infections/figures

  1. Paired samples

Sometimes we have samples that are drawn from same patient (for instance, time is varying). Currently, we do not have method for plotting that kind of plot. The best that can be done currently is this:


tse_list <- splitOn(tse, "sampletype")

plot_list <- lapply(tse_list, function(x){
    colnames(x) <- x$mappac_id
    p <- plotAbundance(x, as.relative = TRUE,, rank = "Phylum", add_x_text = TRUE) +
        labs(title = unique(colData(x)[["sampletype"]]))
    return(p)
})
wrap_plots(plot_list, ncol = 1)

image

but as you can see, the samples do not match. (Maybe we could add missing samples, for instance in the figure above, to sampletype2?)

@Daenarys8 Can you check if you can find solutions for these? We can then discuss more how to implement them.

Daenarys8 commented 3 months ago

I checked some of these and it is interesting because we do have

  1. order.col.by which can order the taxa but with the downside of ordering the counts as well. Perhaps we could modify it a little.

plotAbundance(tse, rank = "Phylum", order.col.by = "Firmicutes") Rplot

  1. With some modification to .feature_plotter or .abund_plotter we can achieve displaying column values with facet_wrap. On second thought, if the whole idea of .features_plotter was for column plots, we could remove it totally and modify .abund_plotter to consume col.var as condition for such plot.

plotAbundance(tse, rank = "Phylum", order.col.by = "Firmicutes", col.var = "sampletype")

Rplot01 The above plot could be much better though.

  1. Hmm, I am a bit confused with this 3rd aspect. We earlier cut the data down to 19 samples with each corresponding to only one of sampletype. with 10 belonging to 1 and 9 the other. If I understand correctly, the sample is not missing in sampletype2, it is just not of its sampletype. However, perhaps I didn't understand and thought of it differently.
plot_list <- lapply(tse_list, function(x){
    p <- plotAbundance(x, as.relative = TRUE,, rank = "Phylum", add_x_text = TRUE, order.col.by = "Firmicutes")
    return(p)
})
wrap_plots(plot_list, ncol = 1)

Rplot02

TuomasBorman commented 3 months ago

Looks very nice.

Perhaps 1 is enough. I still have to test it. 2. Looks good.

3.

As you can see from my plot, sample 10 is missing from the sampletype2. You are correct that it is not there at the first place (we do not have sample for "sample10" - "sampletype2"). However, because there are missing sample, the samples are misaligned in plots. The plot would be tidier, if the sampletype2 and sampletype1 would align with each other. (Would be easier to read and in practice, we would not need the sample labels anymore.)

However, I am wondering what is the best way to showcase paired samples. One option is to add "empty sample" in place of missing samples (here "sample10" - "sampletype2").

Can you check if this is already solved in some papers? We could then get the idea from them

TuomasBorman commented 2 months ago

1. That also orders the data based on certain feature. However, my collaborator wants that "unidentified" taxa is in the bottom of the graph.

We could add additional parameter to .order_abund_feature_data(?) that controls which feature is on the bottom of the graph. It could work little bit similarly to order.col.by but without ordering the samples (Just the order of color bars).

2.

The idea of .features_plotter is to visualize colData variable. However, it can also visualize continuous variables which facets cannot. For me, facets look better for categorical variables. However, for some people the current option might look better.

That is why I think we should have option for this. Maybe, facet.cols = FALSE that creates facets from col.var

3.

As already mentioned, we should handle missing samples if user wants to visualize paired samples. There could be paired=TRUE option that makes sure that the order of samples stays the same in all facets (so that they are comparable).

Can you create a draft that takes into account these? Let's then discuss what is the best approach as this might be little bit complex issue and requires re-structuring the function.

antagomir commented 2 months ago

1) Clarity relation with order.row.by argument; should this be "bottom.row" or should we just provide examples how the user can provide arbitrary sorting?

2) not sure if I understood but sounds worth testing

3) good

TuomasBorman commented 2 months ago
  1. One option could be that user can specify order with factor levels. That might be the easiest perhaps. So instead of characters, rowData variable could be a factor

The point was that sample information is now plotted as separate plot. However, these groups could be plotted also as facets. However, facets are only for categorical variables, not for numeric variables. That is why we should still keep the current functionality also.

One problem is that it makes the function more complex for user if we have many different options

antagomir commented 2 months ago
  1. User could provide ordering of the levels in the order.row.by?
  2. Ok. Either support both options, or provide separate solutions and explain all of them and their differences in a single place (function example manpage, and/or in OMA?)
TuomasBorman commented 2 months ago

1.

That is not possible. User can only specify either "name" (alphabetical order, "abund" (abundance), or "revabund" (reverse abundance).

The idea is to get this kind of plot. Here "Other" group is not interesting, so it is in the bottom. I found that some papers have this kind of plot. image

antagomir commented 2 months ago
  1. but it could be: if user provides a single string, then it is done as you write; if user provides a factor with many levels (number equaling the features) then it could be used to determine order?
TuomasBorman commented 2 months ago

1.

That might be the easiest and most transparent solution. However, we should check that those elements in a vector match with features.

If user wants to agglimerate the data, it might not be clear what those names are. We could disable the vector option if user wants to agglomerate.

(The same solution could work for columns also)

antagomir commented 2 months ago

Sounds good. There could be informative warning if user tries to do both.

TuomasBorman commented 1 month ago

@Daenarys8 Would you be able to create a draft for these?

TuomasBorman commented 1 month ago

I am currently working with this and hopefully get something out tomorrow