ggobi / ggally

R package that extends ggplot2
http://ggobi.github.io/ggally/
587 stars 119 forks source link

Proposal of a cross-tabulation matrix plot #317

Closed larmarange closed 4 years ago

larmarange commented 4 years ago

Hi,

I have developed a ggcross() function for quick and easy cross-tabultion graph with the possibility of showing Pearson's residuals. Please see examples on http://larmarange.github.io/JLutils/reference/ggcross.html

Would it be of interest for GGally? If yes, would you mind consider a Pull Request?

image

schloerke commented 4 years ago

@larmarange Yes, please!

I have couple of ideas, let me know what you think...


Thank you again for reaching out!

dicook commented 4 years ago

I really like this! I'm going to leave it to @schloerke to make the decision on the pull request.

On 22 Apr 2020, at 1:17 am, Joseph notifications@github.com wrote:

Hi,

I have developed a ggcross() function for quick and easy cross-tabultion graph with the possibility of showing Pearson's residuals. Please see examples on http://larmarange.github.io/JLutils/reference/ggcross.html http://larmarange.github.io/JLutils/reference/ggcross.html Would it be of interest for GGally? If yes, would you mind consider a Pull Request?

https://user-images.githubusercontent.com/966307/79883116-cf834680-83f3-11ea-9842-c1781f1ad0da.png — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ggobi/ggally/issues/317, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB52B26KSELFVU5IGLDX4LRNW2G5ANCNFSM4MNMUUHA.


Di Cook visnut@gmail.com

larmarange commented 4 years ago

Dear @schloerke

This code was developed some months ago, probably need to be refactored, and I'm open to any suggestion.

For example, the internal function .tidy_chisq() could now be replaced by augment() method from broom (if such import is possible in GGally).

First of all, I do not know yet if ggally_cross() should be a wrapper around ggcross() or if t should be 2 independent functions. Perhaps, it would be easier if the expected default output of ggally_cross() was defined:

Not a problem of course for ggally_cross() to adopt ggally_* API and to use hjust and vjust for positioning squares.

Regarding the cuts and the residual, so far I used the Pearson's residuals and in some literature, they compare them around +- 2.

However, some other authors recommend to use rather the standardized residual (stdres) proposed by Agresti. According to him

When H0 is true, each standardized residual has a large-sample standard normal distribution.A standardized residual having absolute value that exceeds about 2 when there are few cells or about 3 when there are many cells indicates lack of fit of H0 in that cell. (Under H0, we expect about 5% of the standardized residuals to be farther from 0 than ±2 by chance alone.)

larmarange commented 4 years ago

Trying to brainstorm on what you said, I have the feeling that current behaviour of ggcross() could be achieved using ggpairs() and dedicated ggally_* functions.

First of all, it could be relevant to develop a generic stat_cross() that could be used in different ggally_* functions. I tried to develop a first draft visible here: http://larmarange.github.io/JLutils/reference/stat_cross.html

Using such stat, it would be easier to develop several ggally_* functions, for example:

etc. Of course to be discussed.

schloerke commented 4 years ago

I'm really enjoying the idea of ggcross() wrapping around ggpairs()/ggduo() with many ggally_*() functions.

I don't know if we should combine these into smaller functions, or if they should be separate. Ex: ggally_prop(type = "row") vs ggally_prop_row() and ggally_chisq(type = "stdres") vs ggally_chisq_stdres(). Both ggally_row_prop() and ggally_col_prop() would be very similar plots, just different numbers and same with ggally_chisq_stdres() and ggally_chisq_test().

However, if there is a benefit / default value we can provide, then I'm all for separating the functions. (Such as the lower triangle of ggpairs uses row and the upper triangle uses col.)


For example, the internal function .tidy_chisq() could now be replaced by augment() method from broom (if such import is possible in GGally).

broom is already Suggested, so please feel free to use it!

Regarding the cuts and the residual, so far I used the Pearson's residuals and in some literature, they compare them around +- 2.

Thank you for looking into this! Maybe we could default to switch(resid_type, pearson = c(-2, 2), std_resid = c(-3, -2, 2, 3))? This would allow for "significant" and "very significant" results to be determined by the user.

Could also default to the same value to keep colors consistent. (Which may not be a bad idea.)

larmarange commented 4 years ago

Hi @schloerke

I have explored few options and drafted some code (not properly documented nor committed at this stage) in order to give you some proposals for discussion.

The main idea is to propose several ggally_*() functions and then some high-level wrapper around ggduo().

ggally_*() proposals

ggally_count()

The main idea of ggally_count() is to be used in ggpairs(). It is inspired by ggally_ratio() but will take into account a global colour aesthetic. Another difference is that rectangles are voluntarily centred in order that the visualisation will be symmetrical regarding x and y. It relies on geom_tile() and therefore will adapt nicely regardless of the number of variables in the plot.

There is also a diagonal version of it. An example.

ggpairs(
  tips, 
  columns = c("sex", "smoker", "day", "time"), 
  upper = list(discrete = "count"), 
  diag = list(discrete = "countDiag"), 
  lower = list(discrete = "ratio")
)

image

ggpairs(
  tips, 
  columns = c("sex", "smoker", "day", "time"), 
  upper = list(discrete = "count"), 
  diag = list(discrete = "countDiag"), 
  lower = list(discrete = "ratio"),
  mapping = aes(colour = sex)
)

image

ggally_cross() and ggally_table()

ggpairs(
  tips,
  columns = c("sex", "smoker", "day", "time"),
  upper = list(discrete = "cross"),
  diag = list(discrete = "tableDiag"),
  lower = list(discrete = "table"),
  mapping = aes(color = sex)
)

image

They both use the proposed stat_cross(). There are not intended to be split by color but it's used when this aesthetic corresponds to x or y

ggally_cross() relies on geom_point() and therefore it is sometimes necessary to pass a max_size paramater. So it's not the best for ggpairs().

Both accept additional parameters to display the various statistics derived from chi-square test. By default, ggally_cross() uses the observed observations for size and standardized residuals for fill.

ggally_table(
  tips, 
  aes(x = day, y = sex), 
  border_colour = "black", 
  additional_mapping = aes(
    fill = after_stat(stdres), 
    label = scales::percent(after_stat(prop))
  )
) + scale_fill_steps2(breaks = c(-4,-2,2,2))

image

The idea would be to use them in particular with global wrappers (i.e. ggcross() and ggtable()) to reproduce the current behaviour of the original ggcross() proposal.

ggally_autopoint()

Using ggforce::geom_autopoint() it allows to displays points also with categorical variables. A diag version could also be developed.

ggpairs(
  tips, 
  upper = list(continuous = "autopoint", discrete = "autopoint", combo = "autopoint"), 
  diag = list(discrete = "autopointDiag", continuous = "autopointDiag"), 
  mapping = aes(color = sex)
)

image

ggally_colbar() and ggally_rowbar()

ggpairs(
  tips, 
  columns = c("sex", "smoker", "day", "time"), 
  upper = list(discrete = "colbar"), 
  lower = list(discrete = "rowbar"), 
  diag = list(discrete = "countDiag")
)

image

Visual representation with stacked bars of row and col proportions, using the proposed stat_prop().

Here again, the purpose would be to used them mainly with global wrappers around ggduo().

Other points of discussion

Probably to be discussed in a separate issue.

It is possible to pass a theme to all graphs of a ggmatix using ggpairs(...) + theme(...).

Would it be possible to extend it to scales_* and to coord_*. For some wrapper (for example around ggally_table()) it would be very useful to fix the colour scale. Otherwise, another system to pass the colour scale would have to be developed. But I have the feeling that extending +.gg would be a better and nicer approach.

How to proceed?

Once the different proposals discussed, how do you want to proceed? Do you prefer several PR or just one big PR?

Best regards and sorry for this long message

schloerke commented 4 years ago

How to proceed

If possible, smaller PRs are typically easier to manage. If it seems easier to combine different ideas into one PR, great! (There is no hard rule on this. If it ends up being one PR, no problem.)


Is it possible pass a theme?

GGally::ggpairs(iris, 1:2) + ggplot2::theme_bw()
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2

Created on 2020-04-23 by the reprex package (v0.3.0)


Would it be possible to extend it to scales* and to coord*.

Currently no. But there have been other requests. It's worth making a new GitHub issue. The only real difficulty is cleanly specifying the locations the scale should be used.


Plots

larmarange commented 4 years ago

If possible, smaller PRs are typically easier to manage. If it seems easier to combine different ideas into one PR, great! (There is no hard rule on this. If it ends up being one PR, no problem.)

OK. I will divide the work in different PR. For each PR, do you want just the code or should I include the result of devtools::document()?

Would it be possible to extend it to scales* and to coord*.

Currently no. But there have been other requests. It's worth making a new GitHub issue. The only real difficulty is cleanly specifying the locations the scale should be used.

I'm not sure it would be easy to indicate locations using + syntax.

Maybe a possibility would be to develop a specific function add_element(function, element, cols, rows) when you want to add only to certain cells and that the + is covering only all cols and all rows?

  • ggally_count(): Can we lighten the color to match the existing color value in ggally_ratio()?

Yes, it would be possible by adding a default_fill argument (to be used when no mapping fill is provided)

  • ggally_table(): The color mechanism idea is great! It might be good to add that in the future version of ggally_cor()

Not sure about ggally_cor() as the current behaviour is quite nice: black for global statistic + color for each sub-group

  • ggally_cross(): Is it possible to reverse the color and fill values? Allowing fill to match the ggpairs's color and the border to represent the chisq value. Worth a try. Motivation is the chisquare value can be found if looked at, but the regular color value is dominate, making it easier to scan / match. (Note for later... the scale labels seem off)

For me, the purpose of ggally_cross() is to be mainly used by a global ggcross() wrapper and to display residuals and not to fit nicely in the context of ggpairs().

It would be possible to develop an alternate version for ggpairs() that primary objective would be to display the number of observations and colours passed globally. In fact, it would be a variant of ggally_count() in terms of data presented. But I can add a ggally_count2() derived from ggally_cross().

  • ggally_autopoint(): Neat! This is a great addition! (In the example plot, size is numeric and should not be jittered)

Where is size jittered on the example? It is not the case when comparing size with another continuous variable. When compared to a categorical variable, it is jittered only on the categorical axis (as expected by geom_autopoint()

schloerke commented 4 years ago

@larmarange Have we merged (or created) PRs for all new functions that we plan to add for GGally v2.0.0?

Is there anything else we would like to add before the release? Thank you so much for your work! And thank you for the quick turn around once we were notified from CRAN on when we had to resubmit!

The only thing I have (currently) deferred is https://github.com/ggobi/ggally/issues/345

larmarange commented 4 years ago

Almost. I just need to incorporate your comments on #351 and to finalise the documentation/vignettes of all merged PR (cf. #350 and #334).

I will work on both today.