njtierney / brolgar

BRowse Over Longitudinal Data Graphically and Analytically in R
http://brolgar.njtierney.com/
Other
108 stars 10 forks source link

Including an option for a shadow for all data #105

Open emitanaka opened 3 years ago

emitanaka commented 3 years ago

One option that would be nice is to include the shadow of all data as an option for facet_sample or facet_strata. Below is not from using these facets but an example of how a few individuals are highlighted in individual facets but the grey lines show the data for all individuals. I find this "shadow" technique makes comparisons of the individuals to others easier.

Screen Shot 2020-11-01 at 9 29 40 pm

I also think this will be a great combination with the facet_wrap_paginate. This way if there are too many individuals, rather than fitting too many in one facet or choosing a select few, there is an option to print them in a separate graph with an option like page.

dicook commented 3 years ago

Can this be done using a regular facet? (facet_wrap, facet_grid) Because it seems it should generally be an argument for these functions too. I just found an example, its messy, though:

https://stackoverflow.com/questions/35550411/plotting-the-whole-data-within-each-facet-using-facet-wrap-and-ggplot2

njtierney commented 3 years ago

I quite like the appeal of this! I think it helps demonstrate more clearly what facet_sample/facet_strata are doing. Thanks for the idea, Emi.

However, I'm not sure the best way to go about it, as (I think?) it involves adding another geom layer - while I think in most cases I would expect people to use geom_point(), I think I might need to capture the geom used to replicate it as the background.

Another option could be to overwrite the fill/colour/alpha argument to be the variable matching which facet it is in. However I guess that will override whatever the use might have set for that, which perhaps isn't the worst thing?

emitanaka commented 3 years ago

Yeah, I see now this is hard to implement. Facet has no way of changing color. @dicook it would be nice to have in facet_wrap or facet_grid but the way it's programmed, it's independent of Geom or Stat so I don't think it's impossible without changing the way ggplot2 works. The only way to have it for facet_sample is to also include a Geom layer too but looks like there's no way to know what other geoms have been used at that point and also requires the data be inherited into this geom too... so looks too hard

njtierney commented 3 years ago

I'd like to keep it open to consider a way around this though, even if there are some other functions, I think it is possible.

emitanaka commented 3 years ago

I think this should not be implemented in facet_sample though. In the grammar of graphics, you would assume that facet_* family is linked to Facet and this is an independent component which shouldn't impact upon a Geom or Stat, otherwise, it's violating the principles of the grammar of graphics.

emitanaka commented 3 years ago

(Apologies ahead this is long) Below result was unexpected:

library(brolgar)
library(tidyverse)

set.seed(1)

# * I need to have it as tsibble for it to work otherwise it produces error
# * can't people define the key in the facet_sample as an option? 
# * Why the dependency on tsibble? 
# People should have choice to use data.frame, tibble or tsibble? Most of tidyverse works with data.frame and tibble, not just the latter. Shouldn't broglar be the same?
df <- ChickWeight %>% 
  as_tsibble(key = Chick, index = Time, regular = FALSE) 

ggplot(df, aes(Time, weight, group = Chick)) +
  # hmm this doesn't work as expected with facet_sample
  geom_line(data = mutate(df, Chick2 = Chick),
            aes(group = Chick2), color = "gray") +
  geom_line() + 
  facet_sample()

The shadow got sampled too. tsibble looks like preserves the key even after renaming the key column and facet_sample is using this to sample it.


# without brolgar

## no randomisation & no n_per_facet below 
## but does the shadow trick as expected
ChickWeight %>% 
  mutate(facet_group = as.numeric(Chick) %% 12 + 1 ) %>% 
  ggplot(aes(Time, weight, group = Chick)) +
  geom_line(data = rename(ChickWeight, Chick2 = Chick),  # trick to repeat in each facet
            aes(group = Chick2), color = "gray") +
  geom_line() + 
  facet_wrap(~facet_group)

Created on 2020-11-02 by the reprex package (v0.3.0.9001)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.1 (2020-06-06) #> os macOS Catalina 10.15.7 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Melbourne #> date 2020-11-02 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> anytime 0.3.9 2020-08-27 [1] CRAN (R 4.0.2) #> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.0.0) #> backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2) #> blob 1.2.1 2020-01-20 [2] CRAN (R 4.0.0) #> brolgar * 0.0.6.9100 2020-10-30 [1] Github (njtierney/brolgar@28e95bb) #> broom 0.7.0 2020-07-09 [1] CRAN (R 4.0.2) #> cellranger 1.1.0 2016-07-27 [2] CRAN (R 4.0.0) #> cli 2.1.0 2020-10-12 [1] CRAN (R 4.0.2) #> colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.2) #> crayon 1.3.4 2017-09-16 [2] CRAN (R 4.0.0) #> curl 4.3 2019-12-02 [2] CRAN (R 4.0.0) #> DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.2) #> dbplyr 1.4.4 2020-05-27 [1] CRAN (R 4.0.2) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) #> distributional 0.2.1 2020-10-06 [1] CRAN (R 4.0.2) #> dplyr * 1.0.1 2020-07-26 [1] Github (tidyverse/dplyr@16647fc) #> ellipsis 0.3.1 2020-05-15 [2] CRAN (R 4.0.0) #> evaluate 0.14 2019-05-28 [2] CRAN (R 4.0.0) #> fabletools 0.2.1 2020-09-03 [1] CRAN (R 4.0.2) #> fansi 0.4.1 2020-01-08 [2] CRAN (R 4.0.0) #> farver 2.0.3.9000 2020-07-24 [1] Github (thomasp85/farver@f1bcb56) #> forcats * 0.5.0 2020-03-01 [2] CRAN (R 4.0.0) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> generics 0.0.2 2018-11-29 [2] CRAN (R 4.0.0) #> ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.2) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> gtable 0.3.0 2019-03-25 [2] CRAN (R 4.0.0) #> haven 2.3.1 2020-06-01 [2] CRAN (R 4.0.0) #> highr 0.8 2019-03-20 [2] CRAN (R 4.0.0) #> hms 0.5.3 2020-01-08 [2] CRAN (R 4.0.0) #> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2) #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2) #> jsonlite 1.7.1 2020-09-07 [1] CRAN (R 4.0.2) #> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.2) #> labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.2) #> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) #> lubridate 1.7.9 2020-06-08 [2] CRAN (R 4.0.1) #> magrittr 1.5 2014-11-22 [2] CRAN (R 4.0.0) #> mime 0.9 2020-02-04 [2] CRAN (R 4.0.0) #> modelr 0.1.8 2020-05-19 [2] CRAN (R 4.0.0) #> munsell 0.5.0 2018-06-12 [2] CRAN (R 4.0.0) #> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.1) #> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.0.0) #> purrr * 0.3.4 2020-04-17 [2] CRAN (R 4.0.0) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) #> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.0) #> readr * 1.3.1 2018-12-21 [2] CRAN (R 4.0.0) #> readxl 1.3.1 2019-03-13 [2] CRAN (R 4.0.0) #> reprex 0.3.0.9001 2020-08-08 [1] Github (tidyverse/reprex@9594ee9) #> rlang 0.4.8 2020-10-08 [1] CRAN (R 4.0.2) #> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.2) #> rstudioapi 0.11 2020-02-07 [2] CRAN (R 4.0.0) #> rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2) #> scales 1.1.1 2020-05-11 [2] CRAN (R 4.0.0) #> sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 4.0.0) #> stringi 1.4.6 2020-02-17 [2] CRAN (R 4.0.0) #> stringr * 1.4.0 2019-02-10 [2] CRAN (R 4.0.0) #> styler 1.3.2 2020-02-23 [1] CRAN (R 4.0.1) #> tibble * 3.0.4 2020-10-12 [1] CRAN (R 4.0.2) #> tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [2] CRAN (R 4.0.0) #> tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.2) #> tsibble 0.9.3.9000 2020-11-01 [1] Github (tidyverts/tsibble@e749eb6) #> vctrs 0.3.2.9000 2020-07-26 [1] Github (r-lib/vctrs@df8a659) #> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2) #> xfun 0.16 2020-07-24 [1] CRAN (R 4.0.2) #> xml2 1.3.2 2020-04-23 [2] CRAN (R 4.0.0) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /Users/etan0038/Library/R/4.0/library #> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ```
emitanaka commented 3 years ago

Thinking further about this. Don't really think it's a good idea to include shadow in facet_sample but a separate function might be good, e.g. shadow_sample

njtierney commented 3 years ago

Thanks for taking the time to post this, @emitanaka ! Really appreciate you taking the time.

RE

# 1 I need to have it as tsibble for it to work otherwise it produces error
# 2 can't people define the key in the facet_sample as an option? 
# 3 Why the dependency on tsibble? 
# People should have choice to use data.frame, tibble or tsibble? Most of tidyverse works with data.frame and tibble, not just the latter. Shouldn't broglar be the same?

As you say, people should be able to use data.frame - I've been chipping away at this at #84 so this is definitely on the cards in the future.

The reason it is only for tsibble at the moment is that it was easier to build an approach that worked using tsibble only. In the first iteration of brolgar I only used data.frame, and the user had to provide the key and/or index each time as arguments to the function, which was repetitive. In the next iteration I provided support for both tsibble and data.frame, but as I was still in the initial design phase of brolgar, a lot of things were changing, which meant updating code in two places, one for data.frame, and one for tsibble. So I decided to just use tsibble for most things until I was more or less set of the design. I think I'm getting there now so it's a good place to move some methods to use data.frame.

brolgar is somewhat opinionated, as I think that longitudinal data should be represented as a tsibble, as the benefits are numerous, and this package is built to be used first and foremost with tidyverts over tidyverse. There are other longitudinal data analysis packages in R that do not use tsibble but the user must provide some version of a key and index to use all the functions, but I think that is abstracted away nicely with tsibble.

Some of these methods don't just sit squarely in the time series world, and so dataframe methods will be made available soon, but I won't be supporting data.frame workflows with features for example, which only support tsibble. But primarily, I will get methods working well for tsibble before getting them to work for data.frame.

Regarding creating the plots with the background shadow, I need to improve documentation (#106) so users can see how the facets are created and how they can avoid using facet_sample/strata if they would like (there is a small example in sample_frac_keys but there should be more).

Here is how to get a similar plot using the functions that power facet_sample()/facet_strata()`:

library(tsibble)
library(brolgar)
library(ggplot2)

df <- ChickWeight %>% 
  as_tsibble(key = Chick, index = Time, regular = FALSE) 

# establish the "foreground" elements
chick_subplots <- df %>% 
  # number of facets * number of individuals per facet
  sample_n_keys(size = 12 * 3) %>%
  stratify_keys(n_strata = 12)

ggplot(df,
       aes(x = Time,
           y = weight,
           group = Chick)) + 
  # plot data as background
  geom_line(color = "gray") +
  # plot foreground
  geom_line(data = chick_subplots) + 
  facet_wrap(~.strata)

Created on 2020-11-03 by the reprex package (v0.3.0)

I'm not sure I fully agree that a facet plot that alters the data/geoms is violating the priciples of the grammar of graphics, although I think that it could potentially be a bit dangerous. @dicook pointed out to me that the scales argument of facet_ allows you to change the facets individually, and another example of geoms changing data is geom_miss_point() in naniar adds data to display missingness, and the example documentation of extending facets alters the data for each plot using bootstrappting. But I can appreciate that the facet_strata/sample functions are actually a bit tricky to reason with because they change the underlying data structure, so hopefully some improvements in documentation can make that a bit better, but I think it is sort of the tradeoff for the feature.

I'm not even sure it's possible for the facet to detect the type of geom call used, so this add_background argument might not even work, but I really like the idea, just not sure what the right abstraction is. Perhaps it will end up doing something similar to gghighlight, but in the interim I'll include the above code in examples so people can see the "recipe" for adding background data to plots.

Thanks for the thoughtful discussion, Emi, interested to hear your thoughts.