tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.45k stars 2.02k forks source link

Access to layer information during build stages #3175

Open thomasp85 opened 5 years ago

thomasp85 commented 5 years ago

I'm opening this issue as a more general discussion ground for the discussions going on in #3170 (which fixes #3116). The reason for this is that in my attempt to implement #3062 I've run into more or less the same issues but from the facet side, and I think the time is up for dealing without it in a more general way.

The problem: Throughout ggplot2's code base the layer data is the sole representative of the layer. It is not assumed that the layer have any other information of interest to share. This makes for clean code. We don't need to pass the layer objects around so we get nice separation of responsibility. But it also means that the layer cannot tell the outside world about any special property it has. Generally this has not been a problem as any specifics has been relevant to the geom and stat exclusively (part of the internal of the layer).

One solution: Take the approach given in #3170 and generalise it, so that any function receiving layer data also receives a layer_param object, as well as formalising setting up this object (probably within setup_layer()). This will require sweeping changes to many methods across many classes (internally and in extension packages) so It's certainly less attractive because of that. On the other hand it will result in an architecture that is consistent (using a params object). We may be able to modify the ggproto dispatch to quietly ignore unknown arguments in order to avoid breaking extension packages but it would remove the safety of early failure when misspelling arguments in code.

Another solution: Attenuate the data object with layer param attributes - this would allow everything to continue to work as it has done before, but will require internal changes to make sure the data keeps the information as it is passed around. A layer_param attribute could keep all relevant information that we would otherwise pass around as a layer_param argument. This will undoubtedly result in some more ugly code though helper functions could probably tidy most of it up. Worst is probably that we now divert into two different ways of passing around information between objects.

A kind of middle ground: Each layer data could also simply get an attribute (or an additional column) telling the index of the layer in the layer stack, and e.g. the facets could seep out layer information from all the layers during setup and then take care of matching that information with the data itself if needed. I'm not sure there's any benefit to this other than avoiding to put arbitrary amount of information in the data attributes. It would certainly result in even more housekeeping code than option two

I'm sure there are solutions I haven't considered so I'd be happy to take a broad discussion about how we want to solve it. I'm certain this is something we want to solve in general in order for development to move forward. (as an aside (can't find the issues right now), it will also allow for the layers to take additional params needed by extensions without triggering the unknown argument warning—I remember ggplotly requesting this at least.

@hadley, @karawoo, @clauswilke, @yutannihilation looking forward to your input

clauswilke commented 5 years ago

I definitely don't like adding the layer_param as attribute to the data. The data gets handed back and forth to numerous functions, many of which get overwritten by extension packages, and you'd have to constantly check that the attribute wasn't removed. That just doesn't seem like a clean approach.

Also, it's not entirely clear to me where you would need this info. Can you provide some example scenarios? E.g., how would you like to implement #3062? It may help with thinking things through.

yutannihilation commented 5 years ago

any function receiving layer data also receives a layer_param object

Though layer_param attribute doesn't seem a good idea, I feel we might need a new class that contains data and layer_param together if they are so tightly paired. But, I don't know if this is even possible (probably not).

I too want to see some example scenario.

thomasp85 commented 5 years ago

Ok, the prime example that provoked the issue is the idea of allowing layers to state whether they are fixed (should appear on every panel). The current situation where we must modify the data to not contain the faceting variable is often cumbersome (see #3062 for more information).

In order to achieve this we'd need to pass a fixed parameter to the layer, but more than that, we'd need the faceting object to know that it should ignore the layer data when calculating the layout (in order for the faceting to not contain panels unique to the fixed data). This requires Facet$compute_layout() to know about the fixed parameter, but currently it only receives the list of layer data and the facet param. The layout could add the fixed information to the facet params during setup if it was provided with the list of layers (it's not, currently), and Facet$compute_layout() could then match the information with the data it receives.

The next step is to properly assign PANEL information to the data, which happens in Facet$map_data(). This function is called once per layer so the information about the layer position in the stack is lost, and so is the possibility of matching it with the correct fixed parameter.

I'm well aware that this could be solved for the faceting system quite easily with some changes, and the number of facet extensions are low so we could quite easily mitigate any breakage, but the issue of layer specific information has now come up twice in a very short time and I feel we should align on a general approach rather than keep duct taping it onto the current code.

yutannihilation commented 5 years ago

Thanks, I agree that it's very likely for us to face the need for making the same change as #3170 on another function, and your example seems the case. But, I'm just not sure if we can generalise it to all functions that uses layer data (probably because I don't understand the whole picture of the building stages).

clauswilke commented 5 years ago

I think we need to be systematic and enumerate all the places where possibly layer parameters could be needed. I believe it's going to be geoms, facets, and stats. We have a solution for geoms that works and doesn't break any existing packages. I think the same solution could be implemented for stats. (And for stats, the compute_panel() function already has the ... argument, so we could even hand the layer parameters down to that function from compute_layer().) For facets, I think it's Ok to break backwards compatibility, since fewer extension packages implement new facets. Anything else?

I prefer the solution where we hand layer_params explicitly to the functions that need them, because that is in line with the rest of the ggplot2 architecture.

If we ever implement layer-specific color scales, I think they should be managed by the layer, so color scales wouldn't need to know about layers.

Anything I'm forgetting?

paleolimbot commented 5 years ago

Just a quick comment... @thomasp85 's solution number two is not too bad to implement ( see https://github.com/tidyverse/ggplot2/compare/master...paleolimbot:Issue-3175-layer-params-data-attr ), although it doesn't help with the facet issue (and it feels a little like duct tape).

yutannihilation commented 5 years ago

Anything else?

For completeness,

Does this make sense...? Probably, I'm still failing to see the scope of this discussion. If the above ggprotos are not the ones we need to consider here, please point out.

clauswilke commented 5 years ago

However, all these objects are stateless, so I'm afraid we'd have to drill down to specific methods that will require layer_param.

yutannihilation commented 5 years ago

I'm afraid so too...

teunbrand commented 9 months ago

To weight in on this discussion that has been dormant for the past 4 years; there is already a data structure that is excellent for keeping related data together without resorting to attributes or inventing a new class of objects. That structure is the good old dataframe. You could use list-columns for any non-atomic information you might want to put in there and have 1 row per layer.

library(ggplot2)
# tibble for nice printing, vctrs::data_frame works just as well
tibble::tibble(
  index = 1:2,
  data = list(mtcars, mpg),
  layer = list(geom_point(), geom_line()),
  fixed = c(TRUE, FALSE)
)
#> # A tibble: 2 × 4
#>   index data                layer      fixed
#>   <int> <list>              <list>     <lgl>
#> 1     1 <df [32 × 11]>      <LyrInstn> TRUE 
#> 2     2 <tibble [234 × 11]> <LyrInstn> FALSE

Created on 2023-11-13 with reprex v2.0.2

In theory, you could even make LayerInstance stateless by keeping the geom_params, stat_params, aes_params etc as columns in the dataframe.