compare_groups with a small number of samples.

LeonardosMageiros commented 3 years ago

Hi,

I would like to calculate the differences between two groups of metagenomic samples using the compare_groups function. My concern is that I have a low number of samples in each group (4 vs 3). If I am not mistaken FDR correction is not recommended with such low numbers. If I apply FDR in my case I have no significant results which does not make sense as my groups are influent and effluent of a Waste Water treatment plant.

Do you think it would be ok to use the default Wilcoxon Rank Sum test and skip the FDR step for my purposes? Maybe you are aware of a better approach?

Thank you very much in advance Leonardos

zachary-foster commented 3 years ago

The wilcox test is a very conservative one since it minimizes the number of assumptions about the distribution of the data, and therefore has low power to detect differences. If you dont have 5+ replicates, you probably won't see much significant differences. For example:

> wilcox.test(c(1, 2, 3), c(10, 23, 40))

    Wilcoxon rank sum exact test

data:  c(1, 2, 3) and c(10, 23, 40)
W = 0, p-value = 0.1
alternative hypothesis: true location shift is not equal to 0

Adding a FDR correction makes it even harder to get significant results. I don't know whether or not low numbers of samples justify not using an FDR correction. I have not heard that but it could be true. There are other multiple comparison corrections you could try.

You can also try using Deseq2 to do the significance test using the experimental branch of metacoder:

devtools::install_github('grunwaldlab/metacoder@deseq2')

Check out the function "calc_diff_abund_deseq2". It might be more sensitive than the Wilcox test and is more appropriate for the data anyway. In the future it will be the default instead of the wilcox test.

LeonardosMageiros commented 3 years ago

Thank you very much fro your quick response. Indeed wilcox test is not ideal to detect significant differences.

If you know a more appropriate test for this scenario please do let me know as I am not sure what I should use.

I would be very glad to test the calc_diff_abund_deseq2 function. Do you have a tutorial for that function? Up till now I was using the following code:

wwtp_obj$data$otu_counts <- wwtp_obj$data$otu_counts[c("taxon_id", "OTU", wwtp_meta$ID)]

wwtp_obj$data$otu_counts <- calc_obs_props(wwtp_obj, "otu_counts")
wwtp_obj$data$tax_abund <- calc_taxon_abund(wwtp_obj, "otu_counts", cols = wwtp_meta$ID)

wwtp_obj$data$diff_table <- compare_groups(wwtp_obj, data = "tax_abund",
                                                cols = wwtp_meta$ID,
                                                groups = wwtp_meta$Group)

If I get things right I should replace compare_groups() with calc_diff_abund_deseq2(). Is that correct? Should I remove calc_obs_props() and calc_taxon_abund() as the new function operates on raw counts?

Thank you very much in advance for your time.

zachary-foster commented 3 years ago

Yea, I would DESeq2 function. It is now in the main development branch, which you can install with this command:

devtools::install_github("grunwaldlab/metacoder")

Yea, you should remove calc_obs_props, but keep calc_taxon_abund if you are interested in taxon differences as opposed to OTU/ASV differences.

You can see an example of the process on the help page once the updated package is installed: ?calc_diff_abund_deseq2

Take a look at that and let me know if you have questions.

LeonardosMageiros commented 3 years ago

Hi again,

I believe I have executed successfully calc_diff_abund_deseq2 function on my dataset using the following commands:

`wwtp_obj$data$otu_counts <- wwtp_obj$data$otu_counts[c("taxon_id", "OTU", wwtp_meta$ID)]

wwtp_obj$data$tax_abund <- calc_taxon_abund(wwtp_obj, "otu_counts", cols = wwtp_meta$ID)

wwtp_obj$data$diff_table <- calc_diff_abund_deseq2(wwtp_obj, data = "tax_abund", cols = wwtp_meta$ID, groups = wwtp_meta$Group)`

After the execution of the last command, diff_table looks like that:

then I am running:

And I have the following error:

Do you have any idea on what is going wrong?

I was also thinking that with my sampling size (n=4vs3) any test result would be unreliable to some extend. Is there a way to just "describe" the differences on my abundances on a heat tree? Something like comparing the average values of the two groups and plotting (ie coloring the nodes) the difference might be the best solution in my case. Would that be something simple to do?

Thank you in advance for your time.

Best Leonardos

LeonardosMageiros commented 3 years ago

Hi,

Just to add on my previous comment I managed to simply compare the means by adding func = function(x, y){mean(x) - mean(y)} to the compare_groups function.

I also managed to visualize the calc_diff_abund_deseq2 by changing the node_color to node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange)

Can you please briefly tell me what calc_diff_abund_deseq2 is doing under the hood?

Thx Leonardos

zachary-foster commented 3 years ago

Sorry for not replying to your previous message, I dont remember seeing it.

Do you have any idea on what is going wrong?

Are you still getting that error or have you fixed it.

Is there a way to just "describe" the differences on my abundances on a heat tree? .. Just to add on my previous comment I managed to simply compare the means by adding

Yea, you can just plot the differences from either compare_groups or calc_diff_abund_deseq2 without considering p value.

Can you please briefly tell me what calc_diff_abund_deseq2 is doing under the hood?

It is basically just a wrapper for DESeq2. I don't understand DESeq2 well enough to summarize how it works. Unlike the wilcox rank sum test of compare_groups, DESeq2's test is designed for proportional data like RNA-Seq and metabarcoding, so it seems like a better tool for the job if you want a p-value associated with your differences. You can read more about DESeq2 in its publication:

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8

LeonardosMageiros commented 3 years ago

Yes yes I managed to work things my self. So everything works for me now.

When you say "plot the differences from either compare_groups or calc_diff_abund_deseq2 without considering p value" you mean plot the log2FoldChange without modifying it based on the p-value. Is that correct?

Would the approach of func = function(x, y){mean(x) - mean(y)} that I describe above, be better for that purpose or should I stick to log fold changes?

I ll take a look in DESeq2 and I ll let you know if I have any questions.

Thank a lot for being so helpful Best Leo

grunwaldlab / metacoder

compare_groups with a small number of samples. #312