mjskay / ggdist

Visualizations of distributions and uncertainty
https://mjskay.github.io/ggdist/
GNU General Public License v3.0
854 stars 26 forks source link

position_jitter() leads to duplication and separation of pointintervals #38

Closed dylanhmorris closed 3 years ago

dylanhmorris commented 4 years ago

Attempt to jitter a default pointinterval (or any pointinterval showing multiple ranges) leads to multiple separated pointintervals for each set of observations. I believe this occurs because the multiple bounds are constructed by the layering of multiple geom_pointinterval() instances on top of one another. These are then treated as separate items to jitter by position_jitter().

Reproducible example:

library(ggplot2)
library(tidybayes)
library(tibble)
library(magrittr)

## create some data
data <- tibble(
    xs = rep(c(rnorm(1), rnorm(1)), 20),
    ys = rnorm(40),
    groups = rep(rep(1:4), 10))

## plotting gives us overplotted pointintervals
data %>% ggplot(
             aes(x = xs,
                 y = ys,
                 point_fill = factor(groups),
                 group = groups)) +
    stat_pointinterval(shape = 21) +
    scale_fill_distiller()

overplotted.pdf

## we try to jitter and this leads to duplication
jitter <- position_jitter(width = 0.1)
data %>% ggplot(
             aes(x = xs,
                 y = ys,
                 point_fill = factor(groups),
                 group = groups)) +
    stat_pointinterval(shape = 21,
                       position = jitter) +
    scale_fill_distiller()

failed_jitter.pdf

mjskay commented 4 years ago

Copying this workaround here from the twitter thread in case anyone wanders along looking for it:

image

My rough guess on this is that fixing it might require a bunch of re-architecting of slabinterval internals... unless I'm mistaken the problem is that the two intervals are different rows in the data table, which then means they are adjusted separately by position_jitter. Fixing this might require changing to list columns internally and nesting both intervals within the same row. On the one hand there might be some benefits to this approach (like allowing removal of redundant points). On the other hand it would require re-writing a bunch of stuff, and even then I'm not even sure it would completely work (e.g. in cases where someone tries to map aesthetics across different intervals within the same estimate, as one tends to do with colors in stat_interval). Would have to check that everything works fine there --- I can't remember off the top of my head if that mapping happens before the position adjustment, in which case it should work okay...

In the meantime, the above workaround should work :)

mjskay commented 3 years ago

Yeah, the list approach won't work because of the need for people to be able to do additional mappings on the .width computed variable. So I think the above workaround of re-using the seed for sub-geoms is the best that can be done in this case.