[PROPOSAL] Interface for derived (kinematic) variables

This issue is meant for discussing the way we compute, access and store variables derived from pose tracks in movement. Apologies for the very long read, but this is an important design choice and warrants some deliberation. The conversation started during our last meeting with @lochhh and @sfmig. This is my attempt to write it up.

Context: movement's dataset structure

Predicted poses (pose tracks) are represented in movement as an xarray.Dataset object - hereafter referred to as a movement dataset (ds in the example code snippets below).

Right after loading, each movement dataset contains two data variables stored as xarray.DataArray objects:

position: with shape (time, individuals, keypoints, space)
confidence: with shape (time, individuals, keypoints)

dataset_structure

You can think of each data variable as a multi-dimensional pandas.DataFrame or as a numpy.array with labeled axes. In xarray terms, each axis (.e.g. time) is called a dimension (dim), while the lableled 'ticks' along each axis are called coordinates (coords).

Grouping data variables together in a dataset makes sense when they share some common dims. In the movement dataset the two variable share 3 out of 4 dims (see image above).

Other related data that do not constitute arrays but instead take the form of key-value pairs can be stored as attributes - i.e. inside the attrs dictionary.

All data variables and attributes can be conveniently accessed via the usual . attribute syntax, e.g.:

position = ds.position   # equivalent to ds["position"]
confidence = ds.confidence   # equivalent to ds["confidence"]
fps = ds.fps  # equivalent to ds.attrs["fps"]

Problem formulation

The position and confidence data variables (+ some attributes) are created automatically after loading predicted poses from one of our supported pose estimation frameworks.

The question is what to do with variables that movement derives from these 'primary' variables. For purposes of illustration we will consider three example variables:

velocity: which is an xarray.DataArray object with the same dims and coords as position.
velocity_pol: velocity in polar coordinates. As of PR #155, this is a transformation of the above variable from cartesian to polar coordinates. It's also an xarray.DataArray, but its space dimension is replaced by space_pol, with rho (magnitude) and phi (angle) as coordinates.
speed: this is the magnitude (euclidean norm) of velocity and is therefore equivalent to the rho in velocity_pol. This could be a represented as an xarray.DataArray that lacks a spatial dimension alltogether (similar to the confidence variable)

Alternatives

Each of the above derived xarray.DataArray objects could be requested and stored in a variety of ways. Below, I'll go through some alternatives, and attempt to supply pros/cons for each:

1. Status quo: derived variables as accessor properties

The status quo relies on extending xarray using accessors. In short, accessors are xarray's way of adding domain-specific funcitonality to its xarray.DataArray and xarray.Dataset objects. This is strongly preferred over the standard way of extending objects (inheritance).

Accordingly, we have implemented a MoveAccessor, which extends xarray.Dataset and is accessed via the keyword "move". For example:

ds.cumsum()  # built-in xarray method
ds.sizes  # built-in xarray attribute

ds.move.validate()  # our movement-specific validation method
ds.move.velocity  # our movement-specific velocity property

Currently, derived variables can be computed via the accessor - ds.move.velocity. Under the hood, when we access the property for the first time, velocity is computed and stored as a data variable within the original dataset, alongside position and confidence. Once computed, it can be accessed in the same way as the 'primary' variables - i.e. ds.velocity or ds["velocity"].

All currently implemented kinematic variables - displacement, velocity, and acceleration - behave in this way. Through PR #155, so do their polar transformations.

velocity = ds.move.velocity
velocity_pol = ds.move.velocity_pol
speed = ds.move.speed

Pros

derived variables are automatically "saved" in a sensible way within the dataset and continue being available for future usage. After they are first computed, they consistently behave like the 'primary' data variables.
we potentially spare some computation. If a user requests velocity again, the stored variable will be returned.
Ease of use: users don't need to know that speed is the magnitude of the velocity vector or how it's exactly computed. They ask for things and they get them (with all the necessary steps happening under the hood).

Cons

Unfamiliar syntax. Users unacquinted with accessors (which is almost all users) may find the .move syntax strange and may not expect the automatic storage of variables.
Data duplication. If we follow this strategy for all three of velocity, velocity_pol and speed, we will be storing the same data in many different ways. velocity_pol is just a simple transform of velocity, so it may not be worth storing within the dataset. The case is even more extreme for speed, if we store both velocity_pol and speed, we would be keeping the exact same array of numbers twice. Moreover, calling ds.move.speed would result in calling both ds.move.velocity and ds.move.velocity_polar under the hood, and users may be surprised by all the extra data variables they suddenly end up with.

2. Getting derived variables via accessor methods

This alternative still relies on the MoveAccessor, but gets to the derived variables via custom methods, instead of custom properties.

For example:

velocity = ds.move.compute_velocity(coordinates="cartesian")
velocity_polar = ds.move.compute_velocity(coordinates="polar")

speed = ds.move.compute_speed()
# the above could be an alias for sth like
speed = ds.move.compute_velocity(coordinates="polar").sel[space_pol="rho"].squeeze()

Each of these methods would return a separate xarray.DataArray object which would NOT be automatically stored in the original dataset.

If the user wishes to store these in the dataset, they could do so explicitly:

ds["velocity"] = ds.move.compute_velocity()

Pros

More explicit, less unexpected things happen under the hood.
No automatic data duplication.
Arguments can be passed to the methods (e.g. "cartesian" vs "polar" in the above example).

Cons

Still relies on the unfamiliar .move syntax.
May result in redundant/duplicate computations. For example, ds.move.compute_speed() would re-compute velocity_polar to get its magnitude, even if the user had previously computed velocity_polar (but hadn't stored it in the same dataset).
Not as easy to use as alternative 1.

3. A mix of accessor properties and methods

From the above, it seems like using accessor properties duplicates data, while using accessor methods duplicates computation. Maybe it's possible to strike a balance between the two:

Some 'priviledged' variables may behave as in alternative 1, i.e. they will be accessed via properties and automatically stored in the dataset. Kinematic variables in cartesian coordinates (e.g. velocity, acceleration) would be good candidates for this.
Other variables, especially those that can be trivially derived from the 'priviledged' ones will not be automatically stored and will be accessible via custom accessor methods.

This mixed approach could look something like this:

velocity = ds.move.velocity
velocity_pol = velocity.move.cart2pol()
speed = velocity.move.magnitude()

This variant would require us to provide an extra accessor to extend xarray.DataArray objects and specifically operate on data variables that contain an appropriate spatial dimension (this is where the cart2pol and magnitude methods would be implemented).

Pros

balances the computation and storage needs

Cons

Still relies on unfamiliar .move syntax.
Some variables behaving one way and others behaving in another way is inconsistent and will probably confuse users.
It requires us to implement an additional accessor for xarray.DataArray objects, potentially leading to further confusion.
Requires some knowledge from the user (e.g. they should know that speed is the magnitude of velocity).

4. Use both accessor properties and methods

Another approach could be to always supply both alternatives 1 and 2 for every variable, so the user could choose between them:

# This would automatically store the variables
# in the dataset, as in alternative 1
velocity = ds.move.velocity
velocity_pol = ds.move.velocity_pol

# This would NOT automatically store the variables
# in the dataset, as in alternative 2
velocity = ds.move.compute_velocity(coordinates="cartesian")  # alternative 2 
velocity_pol = ds.move.compute_velocity(coordinates="polar")  # alternative 2

Pros

Flexibility: the user can choose to prioritise computation or memory

Cons

Developer overhead, we have to implement and test both ways of doing things
We now have two ways of doing things, both of which rely on the unfamiliar .move syntax.
the trade-offs between the two methods may not be readily apparent, and users may be unsure which one to use, leading to situations where we compromise both computation and memory.

5. Forget about accessors

We can always abandon the accessor way of doing things, and (given that inheritance and composition are discouraged for xarray objects) forget about object-oriented programming (OOP) altogether.

We could instead rely on analysis and utility functions that take one xarray.DataArray, apply some operation to it, and return another xarray.DataArray, e.g.:

from movement.analysis import kinematics as kin
from movement.utils.vector import cart2pol, magnitude

velocity = kin.compute_velocity(ds["position"])
velocity_pol = cart2pol(velocity)
speed = magnitude(velocity)

The above is already possible by the way (apart form the magnitude() function, which could be easily added).

Pros

The most explicit of all options (no hidden magic)
no fussing with accessors and OOP, even Python beginners could understand the syntax.
Easy to implement and test.

Cons

Not very convenient to use. Users need to know exactly which modules to import and which functions to call.
More verbose

My personal take

After considering these alternatives, I lean towards sticking with the status quo (alternative 1) - i.e. every derived variable is an accessor property, and they all get stored as data variables in the dataset, duplication be damned.

This means that users will have to get used to the slightly strange .move syntax and behaviour, but at least these will be consistent throughout and there will be one main syntax to learn.

Power users who wish to override the default 'magic' behaviour can do so by using alternative 5, which already works anyway (and is what actually happens under the hood).

That said, I'm open to counter-arguments, and there may well be alternatives I haven't considered, so please chime in @neuroinformatics-unit/behaviour @b-peri !

Some further thoughts on this, based on today's meeting:

Accessor object names

The name MoveAccessor is quite cryptic, as most people won't have encountered xarray accessors before. I propose renaming it to MovementDataset - as in the movement-specific extension of an xarray.Dataset. It may better convey the fact that this is basically an xarray.Dataset, but with some specific added functionality. Similarly if we also end up building a xarray.DataArray accessor (if we go with alternative 3 above), we can call that class MovementDataArray. Alternative shorter names could be MoveDataset and MoveDataArray, but that may be confused with a function, i.e. something that moves ad dataset. In either case, I think we can keep using the move keyword for accessing the added methods and properties.

Data variables in polar coordinates

We speculated that polar coordinates systems may not be that often used in practice. Therefore it's probably fine to omit properties like velocity_pol from the accessor. If people want to convert from cartesian to polar, they may either use the utils.vector module directly (as in alternative 5), i.e.:

from movement.utils.vector import cart2pol, magnitude

velocity = ds.move.velocity.   # we still keep the cartesian versions in the accessor
velocity_pol = cart2pol(velocity)
speed = magnitude(velocity)

As mentioned in alternative 3, we can also implement vector utils as custom methods of a DataArray accessor:

velocity = ds.move.velocity
velocity_pol = velocity.move.cart2pol()
speed = velocity.move.magnitude()

I currently tend to like this option. As a drawback, I had mentioned that people would need to know that "speed" is the magnitude of the velocity vector, but perhaps that's not a bad thing. It will make everyone aware of how we define and use these terms.

I'm still somewhat undecided though...

I think movement, as a Python "toolbox", should offer "tools" and let users decide what variables to store and under what names.

If we go for alternative 1 (i.e. storing everything) and the user decides to modify (e.g. apply confidence filter to) say ds.position, we either need to update every stored property derived from ds.position or rely on the user to do so.

from movement.filtering import filter_by_confidence
from movement.analysis import kinematics as kin
from movement.utils.vector import norm # not implemented

ds.move.velocity # this stores velocity in ds
ds  = filter_by_confidence(ds, threshold=0.6, print_report=True) # this (atm) filters position only

# neither of the below recomputes velocity based on the filtered position
ds.move.velocity
ds.velocity 

# need to explicitly recompute velocity
ds.velocity = kin.compute_velocity(ds.position)

Conversely, if we only provide methods that do not store anything (i.e. no property accessors) across the package:

ds.velocity = ds.move.compute_velocity() # preferred method
ds.velocity = kin.compute_velocity(ds.position) #alternative

ds  = filter_by_confidence(ds, threshold=0.6, print_report=True) 
ds.velocity = ds.move.compute_velocity() # recompute velocity

ds.speed = norm(ds.velocity)

+ we do not need to decide what to store + consistent interface across the package (all methods do not store anything?) + MoveAccessor methods do not need to make assumptions on what the variable is called in the dataset + no hidden surprises (i.e. intermediary variables stored in ds), everything is explicit - relies on users to store variables - still need the MoveAccessor for providing convenience methods - still need to decide what methods qualify for the MoveAccessor or could we possibly override __getattr__ in MoveAccessor, as is done in Traja?

Thanks for the response @lochhh!

I kind of like your approach, though I need to sleep over it, and come back to it tomorrow.

I like having some frequently used functions as accessor methods (ds.move.compute_velocity()) and we forget about custom properties for now. We leave it to the users to decide what to name the derived arrays and where to store them. That means that we cannot make any assumptions about those name for some downstream action - i.e. we cannot have another method assumes the existence of a "velocity" data variable, but that's probably for the better. The only things we can rely on are those that get created (and validated) when loading from a file.

In this model, we could even have convenience methods in the accessor, like:

def compute_speed(self):
    velocity = self.compute_velocity()
    return norm(velocity)

def compute_distance_traveled(self, from, to):
    displacement = self.compute_displacement()
    return displacement.sel(time=slice(from, to)).cumsum()

This would not store velocity, or displacement, and only return what the user asked for. If the user had already computed velocity at a previous step, they could always directly do norm(ds.velocity) to get speed, as in your example.

If we ever add custom accessor properties, they have to be about something that doesn't change with filtering/interpolation etc., to avoid the re-computing problem you described.

I have two further (related) questions:

Do we keep vector utils like norm in a separate module as they are now, or do we create a DataArray accessor that offers them as methods? I now kind of think having a second accessor may be overkill.
Should we reconsider what the filtering functions do? Currently they take a Dataset (with both position and confidence) and also return a Dataset, in which only the position variable has been updated. We could re-implement them as accessor methods that return a DataArray, e.g:
```
ds.position = ds.move.filter_by_confidence(threshold=0.9) # this would overwrite the existing position variable
ds.position_filtered = ds.move.filter_by_confidence(threshold=0.9)  # this would add a new data array
```
I don't like the first one, because then you can no longer plot the position before/after filtering to see the effects. On the other hand, if you take the second option, and then do ds.move.compute_velocity(), you will still be using the unfiltered position. We could make the compute_velocity() method accept the name of the position data variable as an argument (i.e. position is the default but you can override it with position_filtered or anything else). But this starts getting complex again, so perhaps we're better off leaving the filtering functions as they are.

I like the convenience methods you suggested, and if we decide to go for this "store-nothing" model, more of these methods can be added as required.

I am not sure we need a DataArray accessor. The example you provided:

ds.position = ds.move.filter_by_confidence(threshold=0.9) # this would overwrite the existing position variable

is somewhat confusing, as I would expect the entire ds to be filtered, not just position. Perhaps we could refactor the filtering module to accept and return both DataArrays and Datasets; the DataArray input makes explicit which variable to apply the filter; and the Dataset input allows the filtering to be applied to all or selected variables stored in ds with an additional argument:

ds = filter_by_confidence(ds, threshold=0.6) # apply filter on all data_vars, default for Dataset 
ds = filter_by_confidence(ds, threshold=0.6, data_vars=["position", "velocity"]) # apply filter on selected data_vars
position = filter_by_confidence(ds.position, threshold=0.6)

With the above, we could perhaps add this as a convenience method that takes Dataset with optional args as input, to the MoveAccessor - although I'm not sure how much of a difference this makes:

ds = ds.move.filter_by_confidence(threshold=0.6) 
ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"])

If we can override __getattr__ as mentioned in the previous comment, these are all trivial to add, otherwise we end up doing double the amount of work.

I quite like this:

ds = filter_by_confidence(ds, threshold=0.6, data_vars=["position", "velocity"]) # apply filter on selected data_vars
# and for convenience
ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"])

The only downside is that, conceptually, I'd expect people to first filter position, and then derive velocity, instead of deriving velocity and then filtering both. But I'm fine with giving people options, as long as our examples reflect best practice. Relevant for your work @b-peri

If we can override getattr as mentioned in the previous comment, these are all trivial to add, otherwise we end up doing double the amount of work.

I don't se why we shouldn't (but maybe there are downsides we'll bump into).

It sounds to me that we have an agreement @lochhh. I'll keep the discussion open till the end of this week, and then I'd say we can move on with the implementation.

Would you be willing to take up the conversion of the existing properties to this?

from movement.utils.vector import cart2pol

velocity = ds.move.compute_velocity()
velocity_pol = cart2pol(velocity)

Yep, happy to do this once we run the idea by everyone in this week's behav meeting.

Summary 2024-04-30

This is my attempt at summarising the above discussion, with an emphasis on points that me and @lochhh are agreed on.

We provide methods via the accessor that do not store anything automatically, i.e. user chooses where to store the resulting data variable and how to name it.
```
ds.velocity = ds.move.compute_velocity()
ds.acceleration = ds.move.compute_acceleration()
```
Vector utils like norm or cart2pol are provided by a separate module:
```
from movement.utils.vector import cart2pol, norm

ds.velocity_pol = car2pol(ds.velocity)
ds.speed = norm(ds.velocity)
```
Again, the user has the freedom to add the derived quantities to the dataset (as above) or to store them in separate DataArray objects, or to even put them into a different Dataset.

We may provide some additional convenience methods via the accessor that use some of the vector utils under the hood, e.g.:

def compute_speed(self):
    velocity = self.compute_velocity()
    return norm(velocity)

def compute_distance_traveled(self, from, to):
    displacement = self.compute_displacement()
    return displacement.sel(time=slice(from, to)).cumsum()

This is relevant for #147

We should explore the option of passing more than one data variables to filtering functions (at the user's own risk):
```
filter_by_confidence(ds, threshold=0.6, data_vars=["position", "velocity"])
```
This would return a new dataset, with the specified data_vars updated.

neuroinformatics-unit / movement