Pass data arrays to filtering functions instead of a dataset

neuroinformatics-unit / movement

Python tools for analysing body movements across space and time

http://movement.neuroinformatics.dev

BSD 3-Clause "New" or "Revised" License

77 stars 7 forks source link

Pass data arrays to filtering functions instead of a dataset #191

Open sfmig opened 1 month ago

sfmig commented 1 month ago

Is your feature request related to a problem? Please describe. Almost all filtering/smoothing operations are meant to operate on a single data array (in most cases the position data variable). However, they are currently designed such that they take in an entire movement dataset (so position + confidence) and return a different dataset in whuch only the position variable has been updated.

This made sense for our first filtering function - filter_by_confidence(), which needs both variables, but it no longer makes sense for the other filters like median_filter() and savgol_filter()

Describe the solution you'd like All filtering functions should explicitly accept a DataArray and return the modified DataArray. This is the transparent, no-magic option. For the special case of filter_by_confidence(), we will have to explicitly provide the confidence data array as a 2nd argument.

Describe alternatives you've considered The alternative would be to leave the filter_by_confidence() as is, and only change the other functions. This would make the interface somewhat inconsistent though.

Additional context Modifying filter_by_confidence() to accept both position and confidence data arrays as separate arguments would be a bit awkward. Under the hood, we would have to validate that those match (in shape and labels), and this is almost equivalent to re-building a movement dataset. It feels like we would be disassembling a dataset to pass it to this function, reassemble it internally to do the validation and thresholding, and then return only position.

We've already touched on this topic in #162 but we didn't agree on a solution. Me and @niksirbi agree that the current status-quo feels counter-intuitive. Thoughts @lochhh and @b-peri ?

niksirbi commented 1 month ago

An alternative is to separate filtering functions into multiple modules (which we'll end up doing anyway eventually).

The categories I can envision would be:

Outlier detection

This could reside in movement.prep.detect_outliers. The functions in here would introduce NaNs in the position data variable. The filter_by_confidence() function could go here (maybe renamed as by_confidence(). It could still accept the whole dataset as input.

Future work on #152 and #151 can also go here and be named by_pose_plausibility() and by_temporal_smoothness(). The question is whether they should also accept and return a whole movement dataset, to be consistent with by_confidence(). Technically we wouldn't need to use confidence in these functions, but I may be wrong.

Smoothing

This could reside in movement.prep.smooth and can contain median_filter(), savgol_filter() and similar, which would be altered to accept and return DataArray objects.

Interpolation

This could reside in movement.prep.interpolate and contain interpolate_over_time() and similar. It would also be altered to accept and return DataArray objects.

Utils

This would contain utilities useful for all the above, like report_nan_values()

lochhh commented 1 month ago

I agree with passing data arrays into the filters (and in general utility functions), rather than the entire dataset. As for the special case of filter_by_confidence, wdyt about simply renaming this to filter_position_by_confidence to 1. make explicit that this applies specifically to position and confidence; 2. distinguish this from other generic functions?

sfmig commented 1 month ago

Modifying filter_by_confidence() to accept both position and confidence data arrays as separate arguments would be a bit awkward. Under the hood, we would have to validate that those match (in shape and labels), and this is almost equivalent to re-building a movement dataset.

I kind of imagined xarray would complain if the dimensions and coordinates of position and confidence don't match. Is that not true?

wdyt about simply renaming this to filter_position_by_confidence

I like the explicitness. But we may want to leave flexibility for the case in which someone may want to filter by confidence a derived variable of position. For example, if from position I compute a data array that contains the distance from every keypoint to a fixed point in the arena, I may want to later remove from this data array all values that relate to a keypoint with low confidence (I did this a lot in my research).

I know it is not the classic pipeline we usually think about (remove outliers -> then derive variables from position) but I like that movement gives users flexibility in terms of how / in which order they should analyse their data.

On the categories of pre/post processing, I think it may be early to constrain. I would suggest having loose categories at this point (maybe just one), since adapting things later shouldn't be too costly right?

niksirbi commented 1 month ago

I kind of imagined xarray would complain if the dimensions and coordinates of position and confidence don't match. Is that not true?

Yes, that is indeed true.

I think the main problem is that filter_by_confidence() is and will always be a special case, different to all other filters that operate on a single array.

How about getting rid of it altogether? The only thing the function does is the following:

ds_thresholded = ds.copy()
ds_thresholded.update(
    {"position": ds.position.where(ds.confidence >= threshold)}
)
if print_report:
    report_nan_values(ds, "input dataset")
    report_nan_values(ds_thresholded, "filtered dataset")
return ds_thresholded

So, by getting rid of it we would lose the convenient report printing, and the automatic logging of the operation provided by the @log_to_attrs decorator.

Instead we could simply recommend doing the following:

position_filtered = ds.position.where(ds.confidence >= threshold)

That's just native xarray syntax and works for any two arrays (whenever one variable has to be filtered based on a different variable).

I do feel bad for all the hard work @b-peri did on the nan reports, but these were not in vain, because they'll still work for the other filtering functions, and users can always call the report_nan_values() utility on-demand.

Regarding the @log_to_attrs decorator, we'd have to re-think it anyway, because if the things being filtered are DataArray objects, it doesn't make much sense to log operations in the Dataset object. Doing operations on the DataArray level gives much greater flexibility to users, but makes the whole "reproducible logged workflows" dream I had much more complicated, so maybe we should abandon it for now. We can re-think this problem once movement is more mature and the concept of "workflows" makes more sense.

What do you think about this more "radical" approach? To summarise:

get rid of filter_by_confidence function
recommend using native xarray.where syntax in the relevant examples
drop the @log_to_attrs utility
interpolate_over_time, median_filter and savgol_filter should accept and return DataArray objects, instead of Dataset.

lochhh commented 1 month ago

We can continue using the @log_to_attrs at each of these functions that take DataArray as input to track data provenance; the logs will persist even if the user stores this DataArray back into the Dataset.

We didn't like the idea of a DataArray accessor, but I drafted it here to better visualise its usefulness (or not). This would allow us to continue logging to da.attrs while doing, e.g.:

position.move.filter_gt(confidence, threshold=0.6): filter position based on confidence > 0.6, but really the name should be filter_by_other_gt_threshold - I'm struggling to come up with a better one.
velocity.move.filter_lt(position, threshold=[123,456]): filter velocity based on invalid position < (123,456)

Again this seems like an overkill, as mentioned before, we can simply do:

position.where(confidence > threshold)
velocity.where(position < threshold)

Plus having the DataArray accessor, e.g. ds.position.move.filter_ might confuse users more, since we have ds.move.compute_kinematics.

Perhaps we could consider

ds["position"] = filter_by_other_ge_threshold(position, other=confidence, threshold=0.6)
ds["velocity"] = filter_by_other_ge_threshold(velocity, other=confidence, threshold=0.6)
# and for convenience
ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"])

likewise

ds["position"] = savgol_filter(position, window_length=1, polyorder=2)
ds["velocity"] = savgol_filter(velocity, window_length=1, polyorder=2)
# and for convenience
ds = ds.move.savgol_filter(window_length=1, polyorder=2, data_vars=["position", "velocity"])

sfmig commented 1 month ago

Comments to @niksirbi's message

get rid of filter_by_confidence function

I am not sure, I do think it is useful for users to wrap these lines, even if they are a few. They will be used quite a lot....

recommend using native xarray.where syntax in the relevant examples

On the other hand, the .where syntax is simple, clear and flexible (i.e. valid for any two arrays of same shape) which I really like.

drop the @log_to_attrs utility

I think logging for reproducibility it's a cool idea that we should keep in mind, but agree that maybe we can revisit this once movement is pass the toddler phase :P Shall we make an issue to keep an eye on it?

interpolate_over_time, median_filter and savgol_filter should accept and return DataArray objects, instead of Dataset.

100% yes 👍 👍

sfmig commented 1 month ago

Comments to @lochhh

We didn't like the idea of a DataArray accessor, but I drafted it here to better visualise its usefulness (or not).

I really like this prototyping approach, thanks for doing it.

Again this seems like an overkill

I agree - it is quite clear when the syntax is compared directly.

Plus having the DataArray accessor, e.g. ds.position.move.filter_ might confuse users more

Agreed (it confuses me a bit at least).

Perhaps we could consider ...

IIUC, we are considering three main syntaxes for filtering by confidence:

.filter_by_confidence: would filter using a specific data array (confidence) and a specific method (thresholding and taking greater or equal values).
- .where syntax: would filter based on any kind of condition defined on any data arrays of the same shape. So less constrained to a specific data array or method than the previous one.
.filter_by_other_ge_threshold: somewhat in between. It would filter based on a specific method (thresholding with greater or equal) but on any data array of matching shape.

So there is a choice in how much we constrain the function. My (mild) opinion is that we make "constrained" utils if they wrap a specific case that is used a lot. The justification being that it would save the user a lot of typing.

With this in mind, my thoughts atm are:

that the .where syntax is attractive because it is flexible (which is good) and simple enough so that it does not require a wrapping.
I think it may be useful to still have a (quite constrained) .filter_by_confidence, just because it will be used a lot. But passing data arrays (rather than a data structure).
I find the third approach (filter_by_other_ge_threshold-like) feels unnecessarily restrictive. For example, the user may filter by another data array, but not necessarily by 'greater or equal' (vs for confidence, where constraining by greater or equal will be the case 99% of the time).

On the accessor method vs functional approach, I like the functional one a bit more. I find it less confusing, but that may be my specific background with programming (matlab cough cough).

lochhh commented 1 month ago

To clarify, I was referring to having two main syntaxes for filtering:

(recommended) via the move accessor, where (as @sfmig said) more constrained things happen, i.e.
```
ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"]) 
ds = ds.move.savgol_filter(window_length=1, polyorder=2, data_vars=["position", "velocity"]) 
```
Under the hood, these call for every data_var, the more general (less constrained) functions in the filtering module that take DataArrays as input:
- filter_by_other_<comparator>_threshold, where <comparator> can be any comparison operator (similar implementation),
  - savgol_filter,
  - median_filter, etc.

via the less constrained filtering module

from movement.filtering import filter_by_other_ge_threshold, savgol_filter

ds["position"] = filter_by_other_ge_threshold(ds.position, other=confidence, threshold=0.6)
ds["velocity"] = filter_by_other_ge_threshold(ds.velocity, other=confidence, threshold=0.6)

ds["position"] = savgol_filter(ds.position, window_length=1, polyorder=2)
ds["velocity"] = savgol_filter(ds.velocity, window_length=1, polyorder=2)

sfmig commented 1 month ago

aah gotcha! I don't feel strongly either way tbh... 🤔

What about just having these options?

# savgol & most filters: 
# take data array as first input, rest of inputs are filter-specific parameters
ds["position"] = savgol_filter(ds.position, window_length=1, polyorder=2)
ds["velocity"] = savgol_filter(ds.velocity, window_length=1, polyorder=2)

# filter by confidence following the same structure for the function signature 
# (confidence as a filter-specific parameter)
ds["position"] = filter_by_confidence(ds.position, confidence_array=ds.confidence, threshold=0.6)

# then use .where for more flexible conditions
ds["position"] = ds.position.where(
    ds.position.sel(space="x") <= 6000 & 
    ds.position.sel(space="x") >= 2000
)

niksirbi commented 1 month ago

Thanks a lot @sfmig and @lochhh for the back and forth! These discussions are super useful. Below I give a summary of my current take on this after reading your messages. If I've misunderstood/misrepresented your views, let me know.

The thing we all agree on

All functions in the filtering.py module (which may be split in future) should accept 1 DataArray as input and return the filtered DataArray as output. So as in the examples you both gave:

ds["position"] = savgol_filter(ds.position, window_length=1, polyorder=2)

We won't plan to implement a DataArray accessor. Operations on arrays will use the above functional paradigm.

Accessor methods for filters

I also like the idea of having accessor methods that would call the above filters, but can work on multiple arrays at once:

ds = ds.move.savgol_filter(window_length=1, polyorder=2, data_vars=["position", "velocity"])

This might look like "two ways of doing the same thing", but I don't see it as such. The whole point of having datasets is that you can group multiple arrays together, so it makes sense to provide methods that simplify operations on multiple related arrays.

So to summarise, functions in filtering.py are purely functional, DataArray in, DataArray out, which is consistent with how the kinematics.py module works. We can have accessor methods for the same thing, assuming that they provide the "added value" of conveniently operating on multiple arrays or needing to combine information from >1 arrays (which will be my next point).

What about `filter_by_confidence`?

This function, since it combines information across multiple arrays, makes sense to be implemented as an accessor method:

ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"])

I also expect this to be heavily used, so having a convenient entry point (even if it's "just a wrapper around where") is useful.

That said, I don't see much value in having more generic functions of the filter_by_other_ge_threshold type (neither as functions nor as accessor methods). If people want to do such custom stuff, they might as well just use where.

lochhh commented 1 month ago

Thanks, very nicely summarised, @niksirbi!

What about filter_by_confidence?

This function, since it combines information across multiple arrays, makes sense to be implemented as an accessor method:
ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"]) 
I also expect this to be heavily used, so having a convenient entry point (even if it's "just a wrapper around where") is useful.

That said, I don't see much value in having more generic functions of the filter_by_other_ge_threshold type (neither as functions nor as accessor methods). If people want to do such custom stuff, they might as well just use where.

I feel like there would be a "surprise" element here: Why is filter_by_confidence only available via the accessor, which is why I thought might be nice to have the more generic filter_by_other... On second thought, @sfmig's suggestion below would be a great fit, so we continue having both, consistent with the other filters.

# filter by confidence following the same structure for the function signature 
# (confidence as a filter-specific parameter)
ds["position"] = filter_by_confidence(ds.position, confidence_array=ds.confidence, threshold=0.6)

The main benefit of preferring the above functions to the simpler .where is that we keep a log of the actions.

niksirbi commented 1 month ago

I'm fine with that solution! A clarification regarding the logs, I see how they'll still work if people use the accessor methods, but is there a way to preserve that when they are just using the DataArray functions? I thought that wasn't possible, but your comment above suggests otherwise. Do Datasets inherit attributes from DataArrays if say one of them get's filtered?

lochhh commented 1 month ago

The attributes are attached to the DataArray instead of the Dataset, so you would need to do ds.position.logs instead of ds.logs. But because we have the accessor, we can implement logs property for the Dataset which concats all data_vars.logs.

niksirbi commented 1 month ago

Hmm nice idea, I could see that working... Since you already have some drafts going on, are you ok with undertaking to implement the changes we decided on here (assuming @sfmig agrees)? There's no particular rush on this, you may take your time.

sfmig commented 1 month ago

nice, I like it! I agree with @niksirbi's summary, with the additional @lochhh's suggestion on having filter_by_confidence in both formats. It feels logical and consistent 🤩

neuroinformatics-unit / movement

Pass data arrays to filtering functions instead of a dataset #191

Outlier detection

Smoothing

Interpolation

Utils

The thing we all agree on

Accessor methods for filters

What about filter_by_confidence?

What about filter_by_confidence?

What about `filter_by_confidence`?

What about `filter_by_confidence`?