Restructure incidence object as data frame to allow for facetting

thibautjombart commented 5 years ago

This is a tricky one, since we aggregate data into the $counts matrix, but it would be nice to be able to combine plotting incidence objects with a group for filling and facet_grid() using another criteria. it would get something similar to:

ggplot(x) + geom_histogram(aes(x = dates, fill = aaa) + facet_grid(bbb ~ ccc)

where aaa bbb and ccc are factors.

This may need some rethinking of our internal data represention, so I appreciate it may be mid-to-long term changes.

zkamvar commented 5 years ago

This is related to #76

zkamvar commented 5 years ago

as I mentioned in #76, $counts could be an array, but yes, that would take a LOT of refactoring.

In the meantime, I could write a vignette on how to work through this with the tidyverse to address issues like #86

pbkeating commented 5 years ago

This would be very useful. Wanted to make some epicurves yesterday and used ggplot as couldn't facet with incidence.

zkamvar commented 5 years ago

Note that this was brought up in Ben Bolker's review of the incidence paper:

The main use of the package is for converting from line lists to aggregated incidence data. It would be useful (I can't tell if it's possible) to easily be able to aggregate data that are already in date/incidence form to coarser scales, or to approximately disaggregate incidence data.

My initial thoughts on this:

[this] would require re-structuring the incidence class OR creating a new class (flexincidence?) that could simply just store the dates, intervals, and grouping info and generate incidence objects on the fly (I think Jun Cai suggested something like this sometime in the past).

Thinking about it, this new object can simply inherit from a data frame so that it can easily be plugged into existing data manipulation architecture. We would need to store the aggregation information as attributes (date column, interval, date range, and grouping), but that' shouldn't be too difficult.

The user should be able to use the accessors to get the dates, range, counts, etc. One of the challenges will be how to represent faceted counts when someone wants to use get_counts(). We could return an array (as suggested above), but that may be overkill when all the user would really need is an aggregated long data frame.

caijun commented 5 years ago

I just came up with an idea that we may implement the facetting functionality for incidence plots in an easy, efficient and elegant way. Why not make the best use of the existing facetting functionality provided by ggplot2 package? If incidence plots can be regarded as an extension of ggplot2 Stat or Geom (e.g., geom_line or geom_bar), then the facetting functionality is inherently implemented and already there. Fortunately, since version 2.0.0, ggplot2 has provided the the ggproto system to extend ggplot2 by creating a new stat, geom, or theme.

Looking at the very first example from the Extending ggplot2 vignette, it draws the convex hull of a set of points by creating a new stat.

StatChull <- ggproto("StatChull", Stat,
                     compute_group = function(data, scales) {
                       data[chull(data$x, data$y), , drop = FALSE]
                     },

                     required_aes = c("x", "y")
)

stat_chull <- function(mapping = NULL, data = NULL, geom = "polygon",
                       position = "identity", na.rm = FALSE, show.legend = NA, 
                       inherit.aes = TRUE, ...) {
  layer(
    stat = StatChull, data = data, mapping = mapping, geom = geom, 
    position = position, show.legend = show.legend, inherit.aes = inherit.aes,
    params = list(na.rm = na.rm, ...)
  )
}

Test the new stat_chull stat.

ggplot(mpg, aes(displ, hwy, colour = drv)) + 
  geom_point() + 
  stat_chull(fill = NA)

Once the stat_chull stat is created, ggplot2 gives a lot for free, including the facetting functionality. For example,

ggplot(mpg, aes(displ, hwy, colour = drv)) + 
  geom_point() + 
  stat_chull(fill = NA) + 
  facet_grid(. ~ drv)

Therefore, instead of adjusting the storing structure, we can re-implement the incidence plots in the way of extending ggplot2, for instance, a bar geom called geom_incidence(). The powerful functionalities of graphics and customizations provided by ggplot2, including facetting, are thus inborn. I believe the amount of code to be rewritten would be much less.

thibautjombart commented 5 years ago

I like the idea, but it looks like this would not be using the incidence objects any more, but the raw linelist instead, right? Looks like a slightly different problem from the idea of having an object x which can be used to derive tables of case counts, with a plot(x) that can be facetted.

caijun commented 5 years ago

I think the incidence objects can be kept. We just implement the idea within the existing incidence.plot() and return a ggplot2 object, which also supports chaining pipes. For instance,

myplot <- function(my.obj) {
  # some conversions can be done before plot
  p <- ggplot(my.obj, aes(displ, hwy, colour = drv)) + 
    geom_point() + 
    stat_chull(fill = NA)
  return(p)
}

myplot(mpg) + 
  facet_grid(. ~ drv)

zkamvar commented 5 years ago

I think @caijun has the right idea here. If we are going to move to a framework where incidence objects are created on the fly, there's no reason why we can't also create ggplot2 geom functions since these would inherently rely on the same underlying architecture to create the incidence object in the first place.

This way people can do something like:

indicence(x, dates = date_of_onset, interval = "1 ISO week", group = gender) %>%
  ggplot() +
  geom_incidence(show_cases = TRUE) + # no extra arguments since the internal data is already an incidence object
  scale_fill_incidence(pal = 1) +
  scale_x_incidence() +
  facet_grid(aaa ~ bbb+ccc)

or

ggplot(x, aes(date_of_onset, fill = gender)) +
  geom_incidence(interval = "1 ISO week", show_cases = TRUE) +
  scale_fill_incidence(pal = 1) +
  scale_x_incidence(interval = "1 ISO week") +
  facet_grid(aaa ~ bbb+ccc)

Theoretically, even adding the fit objects should work (though it will be wonky since the users would have to use the %>% syntax instead of +)

Of course, plot.incidence() should still produce the same plots as it always did with the difference being that the internals will use the above construct:

incidence(x, dates = date_of_onset, interval = "1 ISO week", group = gender) %>%
  plot(show_cases = TRUE) +
  facet_grid(aaa ~ bbb + ccc)

We've already seen that several users want to do things with the epicurve that can't really be done with the current framework due to limitations on the data structure itself because it represents an immutable summary, so it make sense to show people that they can use the plot.incidence() method for a quick visualisation and the ggplot geoms for a more customized interface.

caijun commented 5 years ago

Theoretically, even adding the fit objects should work (though it will be wonky since the users would have to use the %>% syntax instead of +)

%>% and + can be re-defined according to our needs in incidence package as they are also objects. Moreover, I like those functions geom_incidence(), scale_fill_incidence() and scale_x_incidence() (used for customizing the x-axis time label).

zkamvar commented 4 years ago

The more I think about it, the more I'm thinking that we should port the internal functionality to the {tsibble} package. It has everything that we have except for the plotting. All of the steps below can be abstracted away for our users and they can return an incidence object if they want or they can return a tsibble. Our plotting can take care of tsibble objects as well. The only problem is that it will make the incidence package heavier.

library(tsibble)
library(dplyr)
library(aweek)
set_week_start("Saturday")
ll <- outbreaks::ebola_sim_clean$linelist
ll %>%
  as_tsibble(key = case_id, index = date_of_onset) %>%
  index_by(week = ~as.Date(aweek::as.aweek(.))) %>%
  group_by(gender) %>%
  summarize(n = n())
#> # A tsibble: 698 x 3 [1D]
#> # Key:       gender [2]
#>    gender week           n
#>    <fct>  <date>     <int>
#>  1 f      2014-04-07     1
#>  2 f      2014-04-21     1
#>  3 f      2014-04-25     1
#>  4 f      2014-04-26     1
#>  5 f      2014-04-27     1
#>  6 f      2014-05-01     2
#>  7 f      2014-05-03     1
#>  8 f      2014-05-04     1
#>  9 f      2014-05-06     2
#> 10 f      2014-05-07     2
#> # … with 688 more rows

^{Created on 2019-12-16 by the reprex package (v0.3.0)}

thibautjombart commented 4 years ago

Sound great! I really like the idea of relying on tsibble for the heavy lifting. So in short, and incidence object would essentially be a tsibble with a standardised column for dates and some validated info on interval (possibly stored as attribute)?

As far as I can tell, re-implementing old features should be easy:

get_counts(): merely count(dates, a, b)
get_n(): could be nrow(), or maybe even accept optional grouping count(a, b)
get_dates(): would become select(dates)
group_names(): setdiff(names(x), "dates")
get_interval(): should be easy, only depends on what we do with this (stored as attribute? calculated on the fly?)

Importantly, that also means people will be able to use standard dplyr commands if they are more familiar with them (e.g. count rather than get_counts).

reconhub / incidence

Restructure incidence object as data frame to allow for facetting #104