kassambara / ggpubr

'ggplot2' Based Publication Ready Plots
https://rpkgs.datanovia.com/ggpubr/
1.13k stars 165 forks source link

Tell stat_compare_means() the column to use for pairing #560

Open Mchicken1988 opened 1 year ago

Mchicken1988 commented 1 year ago

Hi,

first of all I want to say that I love your packages. Thanks for your work!

I'm preparing plots to compare intron retention levels of paired normal and tumor patient samples (1 normal and 1 tumor sample per patient).

Regarding my issue. I prepare plots that are combination of geom_boxplot(), geom_point() and geom_path(), so in the end what ggpaired() does. I add the P value of a paired Wilcoxon test using stat_compare_means(paired=T). However, I ran into the problem that the reported P value is different than the P value I get when I use wilcox.test(). The problem was that the data.frame I used for plotting was not sorted by the ID of the patients. Therefore, the samples wer incorrectly paired by stat_compare_means(). I solved the problem by sorting the data.frame using dplyr::arrange() based on the ID of the patient before preparing the plots.

My question is, whether it would make sense to add a parameter to stat_compare_means() that indicates the column that should be used for pairing (in my case the column containing the ID of the patients). Or at least warn the user that the data.frame should be sorted.

Here is a small example where I create an unsorted dummy data.frame and a sorted data.frame, which are used within ggpaired(). As one can see the data.frame needs to be sorted to have the correct pairing and the correct P value.

library(ggpubr)
library(dplyr)
set.seed(123)
unsorted <- data.frame(sampleType = c(rep("Normal",10), rep("Tumor", 10)),
           value = c(runif(10,0,0.2), runif(10,0, 0.4)),
           ID = c(1:10, 10:1))

sorted <- unsorted %>% arrange(sampleType,ID)

ggpaired(unsorted, x = "sampleType", y = "value",
   color = "sampleType", line.color = "gray", line.size = 0.4,
   palette = "npg")+
 stat_compare_means(paired = TRUE)

ggpaired(sorted, x = "sampleType", y = "value",
   color = "sampleType", line.color = "gray", line.size = 0.4,
   palette = "npg")+
 stat_compare_means(paired = TRUE)

Best, Mario

b-niu commented 2 months ago

Hi @Mchicken1988, I've noticed that I'm facing a similar challenge as you with the stat_compare_means function, and I've also observed that the compare_means function seems to have the same limitation. Specifically, it appears that neither function currently offers a parameter to explicitly indicate which column serves as the ID.

I understand that this might be a feature that's still being developed or considered for the open-source software, and I appreciate the hard work that goes into maintaining and improving it.

Thanks a lot.