matplotlib / data-prototype

https://matplotlib.org/data-prototype
BSD 3-Clause "New" or "Revised" License
5 stars 4 forks source link

Ideas regarding "nu" #26

Closed ksunden closed 1 year ago

ksunden commented 1 year ago

Mutation as the name for "nu"

Data prototype has a concept of nu for performing data-to-data transforms.

Firstly, this name is not descriptive at all, and while it makes sense in the context of a pure math description, programmers are unlikely to have that context. A more descriptive name is preferable.

Thus I propose the term "mutator", though certainly open to other options. The term "mutate", while used in a few docstrings, tests, and variable names in mpl, is not really used in any type names or public signatures outside of Transforms.mutated[xy]? which return booleans. Also has the advantage of using the same vowel sound as nu, so may help those who are familiar with the mathematical framing connect the concept.

Kinds of Mutators

compute

{'x': A} -> {'x': B}

Using the same variable name but achieving a (potentially) different value.

Identity is a subset of this.

spelling on current main:

nu={"x": lambda x: x+1}

rename

{'x': A} -> {'y': A}

Actually somewhat redundant to reuse + delete

spelling on current main:

nu={"y": lambda x: x}

reuse

{'x': A} -> {'x': A, 'y': A}

e.g. "color" expanding to "facecolor" and "edgecolor"

spelling on current main:

nu={"y": lambda x: x, "x": lambda x: x}

(or including "x" in expected/required keys, but including the y lambda)

combine

{'x': A, 'y': B} -> {'z': Z}

spelling on current main:

nu={"z": lambda x, y: x+y}

spelling with #17:

mutual mutation

{'x': A, 'y': B} -> {'x': C, 'y': D}

Importantly, the computation for C and D both depend on the values from A and B.

Potentially has some performance concerns as often perhaps they can actually be computed together, but some frameworks may require computing C and D separately.

spelling on current main:

nu={"x": lambda x, y: x+y, "y": lambda x, y: x-y}

spelling with #17:

NOT POSSIBLE.

While in most cases, #17 will upcast a single function to a list containg only that one function (plus units, if applicable), unlike main, it does operations sequentially. Thus the value of x gets overridden by the first process, and it is not the same when processing y

If x=1, y=2, then main with the nu spefcified above will give an output of x=3, y=-1. #17 will give an output of x=3, y=1.

deletion

{'x': A} -> {}

spelling on current main:

Neither provide a nu for "x" nor include in required/expected keys (as those include a default identity)

chaining

{'x': A} -> {'y': B} -> {'z': C}

Importantly may include more complex operations as each step

spelling on current main:

NOT POSSIBLE, at least not in an elegant/composable way

nu={"z": lambda x: (lambda y: y+1)(x) + 1}

Is kind of the idea, but doesn't allow inspection or mutation of the internal structure. Nor does it provide a way to e.g. add units in automatically aside from strictly before or strictly after.

If you also want to keep "y" in the final, you need to pass (and compute) it separately

spelling with #17:

nu={"x": [lambda x: x+1, lambda x: x+1], "z": lambda x: x}

(which will necessarily keep both x and z, set to the same value)

or

nu={"y": lambda x: x+1, "z": lambda y: y+1}

(which will necessarily keep both y and z, with different values)

While chaining was the purpose of #17, it's implementation is less elegant than I would like. It works reasonably well when chaining things with the same name, but falls apart rather quickly when trying to change names as in this example.

The deeply ingrained order dependence feels awkward and likely to do things that are not intended.

E.g. in the last example, did the user intend for y in the computation of z to be the newly modified version (perhaps not, but maybe). If you flip the y and z it looks the same, but is actually different on that branch.

But I think having intermediate values is useful.

A proposal

The behavior on main has advantages including order independence of nu and relatively easily doing computations with multiple inputs and outputs

The behavior on #17 allows treating units as just another nu function, i.e. separating individual transforms into single logical functions. It also has the advantage of being able to use intermediate calculations, though with a significant drawback of order dependence and not being the most understandable system.

17 introduces a list of functions for each variable to accomplish its goals.

The proposal then is to invert that a bit and instead of having a list of functions for each variable, to have a list of "mutation stages", each of which act as the behavior on main today.

Thus if you want precisely the behavior of main, it is identical to just having a list of one stage.

But if you want intermediate values (and units behavior), you add separate stages.

I've not yet written code for this, but I don't think it'll be that hard to do so.

I think I would lean towards separate objects to manage the interactions, rather than relying on a pure list of dictionaries.

This would allow us to give stages names, which in turn allows a (relatively) ergonomic way of saying: [MyStagePreUnits("pre units", ...), "units", MyStage("post units", ...)]

Mutation stages could each have "expected/required" keys, rather than just an overall. (with the default being to pass every key input plus every nu).

More radical ideas/fallout that may be enabled (but I haven't thought through completely)

These Ideas may fall a little far into "I have a hammer so everything looks like a nail", but I could see a path where each of these make sense.

story645 commented 1 year ago

My major qualm with mutation is that to me at least it implies a change of structure and by definition nus don't change structure. I use encoders b/c in the data viz literature it's fairly common to see the variable->visual mapping described as an encoding, but I get why that may be too specific for your purposes.

I'm slightly confused by how you're spelling rename and reuse, but I think you're saying those are nus with different input and output types but are doing an identity computation? Where different could be as basic as the name?

But I think having intermediate values is useful

For what it's worth, totally agree. At least as part of the first pass to make sure the function stack is being executed currectly. (Nus are associative but not commutative in that way).

The proposal then is to invert that a bit and instead of having a list of functions for each variable, to have a list of "mutation stages", each of which act as the behavior on main today.

I've been thinking about this as layered wrappers, but either way I think you're right that the intended call stack needs to be super clear.

Would decouple the majority of the stack from matplotlib specific code, potentially making this idea viable for other plotting/data analysis libraries

When I've played around with that, the biggest downside was losing mpl specific optimizations.

tacaswell commented 1 year ago

"mutator" is not a great name because it implies in place which I do not think we want. Agree it is unfortunate that we already use the words "conversions" and "transform" in the code base.

I think than both the unit conversion and what we currently call "transforms" are part of this stack is the specialized post-unit x,y -> x, y (as are the norm + colormap) so I think trying to re-claim one of those names

While in most cases, #17 will upcast a single function to a list containg only that one function (plus units, if applicable), unlike main, it does operations sequentially. Thus the value of x gets overridden by the first process, and it is not the same when processing y

Couldn't we solve this by changing it to start each of the lists with the initial data and then merge the results at the end?

I would also have the expectation that for

nu={"x": [lambda x: x+1, lambda x: x+1], "z": lambda x: x}

and {'x': 1} to get back {'x': 3, 'z': 1} as the order in which we chose to evaluate 'x' and 'z' for the output should not matter (yes with dictionary and kwarg stability it could matter and it would be well defined / stable / etc but I think it is too surprising).

We have a bunch of short-cuts where we push the affine part of transformations down into the renderer (e.g. this is why https://github.com/matplotlib/matplotlib/blob/1e8821dff3aeebf0654ec66d5ebf97080768ed09/lib/matplotlib/backend_bases.py#L195-L197 takes a trans argument). If we do go with an object rather than just dictionaries then in principle it should be possible for those things to describe them selves well enough to participate in the affine transformation and for these objects to detect if they have been handed one of our mtransforms.TransformNode instances and be able to merge with its neighbors.

as an aside even though we do push the affine transforms down, we still pre-compute what those transforms are based on the current view limits so if we tried to push it to a GPU you could do loupe style zooming / panning / rotation (well maybe not on text) because you could stick the same transform on the outside of what ever linear transform we passed in (I think), but can not e.g. change the limits because that would require re-computing the transforms and we do not currently have a way to pass how to do that to the backend.

It is probably worth your time @ksunden to read and fully understand how the matplotlib.transforms deal with both affine/non-affine merged transforms and separable/non-separable (x,y) computations (e.g. (linear, linear) is separable, any map transforms are not).