Split data at the top - Githubissues

vincentarelbundock commented 1 month ago

Follow-up to this comment about introducing a by argument to type_*() functions: https://github.com/grantmcdermott/tinyplot/pull/168#discussion_r1685553602

Is there a reason to wait so long before splitting the data? It seems that in some cases we wait until nearly the end to create split_data, and in many other cases like hist or density, we need to insert special catches near the top of the function. I feel like that introduces a lot of complexity.

Have you considered splitting the data at the very top, and storing it in a nested list?

split_data = list(
  facet_01 = list(by_01 = list(x, y), by_02 = list(x, y)),
  facet_02 = list(by_01 = list(x, y), by_02 = list(x, y))
)

Then, all plotting functions can operate on individual elements in exactly the same way. And if the list is of length 1, we know there are no facets.

grantmcdermott commented 1 month ago

Thanks for raising. Ignoring all the roxygen2 (documentation) preamble, here is the basic list of operations that currently happen inside the tinyplot function before split_data:

L512-L580 Capture device settings and pass par arguments.
L581-L606 Deparse arguments before they are evaluated. (Necessary for axis labels etc.)
L607-L751 Special catches for certain plot types (density, hist, etc.). We need to transform the underlying series according to the plot type at hand. We also need to do this on the scale of the original data (i.e., before splitting) to make sure the split data inherit transformation arguments correctly (e.g., break points across the full span of the data / joint distribution).
L752-L800 Some more adjustments, this time for "ribbon-alike" plot types that require data to be sorted along the x variable. We have to make some additional adjustments to, say, sort within facet grid variables too if those are present.
L801-L819 Establishing the extent (range) of the x and y variables, which we'll use for setting the plot window later on.
L820-829 Determining whether the "by" grouping variable is discrete or continuous. Not only do we need this for the legend logic, but it also affects our split logic (since we can't split the data on a continuous variable and instead use an alternative "1:100 interval" based logic for color assignment).
L830 Split data logic starts...

Stepping back and summarizing: there are about 320 lines of code before we get to data splitting. While we could certainly farm out some of sections to standalone functions to minimize code lines (a la https://github.com/grantmcdermott/tinyplot/pull/171), for the most part I think the basic order would need to remain unchanged. It seems to me that 1-4 and 6 would still have to come before split_data, since they either adjust the underlying (raw) variables or affect the splitting logic. We could move item 5 to afterwards, but that's less than 20 lines so not much of a gain.

Of course, it simply may be that the current flow is just an artifact of the code logic that I settled on (which might not be optimal). I'm definitely open to revising if you want to propose an alternative logic and codeflow. One thing I want to flag, though is that the interaction of grouping (by) and faceting gets tricky. Partly this has do with the fact that grouping variables can be nested within and without facets. But it also relates to some annoying limitations of the base graphics device (see the workaround here, for example, where we manually restore par(mfg) because it gets automatically reset after the plot window is divided) .

Feel free to push back if you think I'm missing something. Again, I'd love to simplify and further optimize the code logic if at all possible.

vincentarelbundock commented 1 month ago

Thanks, that's super helpful.

My idea was to move 3&4 after 7. The benefits would be:

Can apply functions like stats::loess() to transform x and y in each split_data[[i]] directly with a simple loop.
No need to implement the split logic for each type. Right now, we need split-related code in both the core function and also in the histogram-specific code. This might lead to big complications if we push forward the idea of type_*() functions, where each of them would have to implement its own by splits.
Related to 2, might allow us to simplify existing histogram code, etc.

The downside is things like identifying global breaks for histograms. But that can be a one-linear like:

get_breaks(unlist(lapply(split_data, \(x) x[["x"]])))

But that seems both easier to code, and less common than splitting, which happens for every time.

vincentarelbundock commented 3 weeks ago

closing as ill-conceived and under-specified proposal. Would be fixed by https://github.com/grantmcdermott/tinyplot/pull/198

grantmcdermott / tinyplot

Split data at the top #172