ggobi / ggally

R package that extends ggplot2
http://ggobi.github.io/ggally/
587 stars 119 forks source link

Hypothesis tests for ggduo #286

Open ewenharrison opened 6 years ago

ewenharrison commented 6 years ago

Sorry if this has come up already. Well done in maintaining such a great package.

Any thoughts about adding pairwise tests to your ggduo function?

Continuous-continous - correlation as in ggpairs.
Continuous-discrete - Kruskal-Wallis. Discrete-discrete - chi-sq/Fisher’s.

Multiple testing a-go-go, but would be very useful.

Would likely need an option on whether to test across missing values as well.

Can I help?

Ewen finalfit

schloerke commented 6 years ago

That's a great idea. Could even be used within ggpairs as well.

For my ease of use, assuming you're looking at the first four columns of reshape's tips:

> reshape::tips[1:4] %>% head()
#   total_bill  tip    sex smoker
# 1      16.99 1.01 Female     No
# 2      10.34 1.66   Male     No
# 3      21.01 3.50   Male     No
# 4      23.68 3.31   Male     No
# 5      24.59 3.61 Female     No
# 6      25.29 4.71   Male     No

What would be the corresponding R commands you'd use to test the pairs (ex: 1-2, 1-3, and 3-4).

For my sanity as well, could you also confirm what is a "good" result and a "bad" result for each one? (ex: low p-value is good)

Yes, controlling for missing values will be important. I believe I would have to defer to how the correlation plots use it where it's either all of X and all of Y, or it is X and Y where both X and Y are not NA. It becomes very confusing when a third column (which is not displayed in the plot) controls which combinations are used.

Thank you in advance

ewenharrison commented 6 years ago

Great, thanks for getting back go me.

1-2

For continuous vs. continuous, you could take the Pearson correlation coefficient (as you do already) together with the p-value from cor.test.

library(reshape)
with(tips, cor.test(total_bill, tip))$p.value

The two values could be shown together

1-3

For continuous vs. discrete, you could use the (Wilcoxon-)Kruskal-Wallis test. This reduces to a Mann-Whitney / Wilcoxon rank sum test for two groups, but has the advantage of continuing to work when there are >2 groups for the discrete variable. I think you would only need to show the p-value.

stats::kruskal.test(total_bill ~ sex, data=tips)$p.value
stats::kruskal.test(total_bill ~ smoker, data=tips)$p.value

3-4

For the discrete test, you could use a simple chi-squared test. This will error when the expected count in any cell is <5, but that is ok given that the overall aim here is "big picture".

with(tips, stats::chisq.test(sex, smoker))$p.value

There will be people who feel strongly about non-hypothesis based multiple testing, but again this would be just a summary function with further analyses done following it.

So you could flag a different colour, say red, when the p-value was less than 0.05.

For missing data, these tests will delete any NA missing values pair/list-wise. You could consider having an na_include logical argument in ggduo. When na_include = FALSE; data = na.omit(data). Then adjust the tests to includeNAs` if present, e.g. chisq.test(table(sex, smoker, useNA = "ifany"))$p.value chisq.test(table(sex, smoker, useNA = "ifany"))$p.value

schloerke commented 6 years ago

Great!

More tweaks to ponder. How would you approach it when data has been grouped (like with color).

As for ggpairs, are there any other tests that can be done for the pairs? I'm thinking for ggpairs, we have the whole other triangle that could be utilized.

We even could do a test for each individual variable...

ewenharrison commented 6 years ago

Testing across 3 variables (x, y and group) would be most useful when looking for interactions (lm(y~x*group)), but I don’t think this is required. If the hypothesis tests are done with regression, then a lot of results will be generated for factors with many levels. Becomes too complicated and beyond the scope of the function. So I would allow grouping for visualisation, but for the hypothesis test just look at x and y.

Again, I don’t think a test on the diagonal / single variable is required. There would be other options, single sample t-test against a mean, but too complicated and not required.