Open sfmig opened 1 month ago
An alternative is to separate filtering functions into multiple modules (which we'll end up doing anyway eventually).
The categories I can envision would be:
This could reside in movement.prep.detect_outliers
. The functions in here would introduce NaNs in the position
data variable. The filter_by_confidence()
function could go here (maybe renamed as by_confidence()
. It could still accept the whole dataset as input.
Future work on #152 and #151 can also go here and be named by_pose_plausibility()
and by_temporal_smoothness()
. The question is whether they should also accept and return a whole movement
dataset, to be consistent with by_confidence()
. Technically we wouldn't need to use confidence
in these functions, but I may be wrong.
This could reside in movement.prep.smooth
and can contain median_filter()
, savgol_filter()
and similar, which would be altered to accept and return DataArray
objects.
This could reside in movement.prep.interpolate
and contain interpolate_over_time()
and similar. It would also be altered to accept and return DataArray
objects.
This would contain utilities useful for all the above, like report_nan_values()
I agree with passing data arrays into the filters (and in general utility functions), rather than the entire dataset. As for the special case of filter_by_confidence
, wdyt about simply renaming this to filter_position_by_confidence
to 1. make explicit that this applies specifically to position
and confidence
; 2. distinguish this from other generic functions?
Modifying filter_by_confidence() to accept both position and confidence data arrays as separate arguments would be a bit awkward. Under the hood, we would have to validate that those match (in shape and labels), and this is almost equivalent to re-building a movement dataset.
I kind of imagined xarray
would complain if the dimensions and coordinates of position
and confidence
don't match. Is that not true?
wdyt about simply renaming this to filter_position_by_confidence
I like the explicitness. But we may want to leave flexibility for the case in which someone may want to filter by confidence a derived variable of position
. For example, if from position
I compute a data array that contains the distance from every keypoint to a fixed point in the arena, I may want to later remove from this data array all values that relate to a keypoint with low confidence (I did this a lot in my research).
I know it is not the classic pipeline we usually think about (remove outliers -> then derive variables from position) but I like that movement
gives users flexibility in terms of how / in which order they should analyse their data.
On the categories of pre/post processing, I think it may be early to constrain. I would suggest having loose categories at this point (maybe just one), since adapting things later shouldn't be too costly right?
I kind of imagined
xarray
would complain if the dimensions and coordinates ofposition
andconfidence
don't match. Is that not true?
Yes, that is indeed true.
I think the main problem is that filter_by_confidence()
is and will always be a special case, different to all other filters that operate on a single array.
How about getting rid of it altogether? The only thing the function does is the following:
ds_thresholded = ds.copy()
ds_thresholded.update(
{"position": ds.position.where(ds.confidence >= threshold)}
)
if print_report:
report_nan_values(ds, "input dataset")
report_nan_values(ds_thresholded, "filtered dataset")
return ds_thresholded
So, by getting rid of it we would lose the convenient report printing, and the automatic logging of the operation provided by the @log_to_attrs
decorator.
Instead we could simply recommend doing the following:
position_filtered = ds.position.where(ds.confidence >= threshold)
That's just native xarray
syntax and works for any two arrays (whenever one variable has to be filtered based on a different variable).
I do feel bad for all the hard work @b-peri did on the nan reports, but these were not in vain, because they'll still work for the other filtering functions, and users can always call the report_nan_values()
utility on-demand.
Regarding the @log_to_attrs
decorator, we'd have to re-think it anyway, because if the things being filtered are DataArray
objects, it doesn't make much sense to log operations in the Dataset
object. Doing operations on the DataArray
level gives much greater flexibility to users, but makes the whole "reproducible logged workflows" dream I had much more complicated, so maybe we should abandon it for now. We can re-think this problem once movement
is more mature and the concept of "workflows" makes more sense.
What do you think about this more "radical" approach? To summarise:
filter_by_confidence
functionxarray.where
syntax in the relevant examples@log_to_attrs
utilityinterpolate_over_time
, median_filter
and savgol_filter
should accept and return DataArray
objects, instead of Dataset
. We can continue using the @log_to_attrs
at each of these functions that take DataArray as input to track data provenance; the logs will persist even if the user stores this DataArray back into the Dataset.
We didn't like the idea of a DataArray
accessor, but I drafted it here to better visualise its usefulness (or not).
This would allow us to continue logging to da.attrs
while doing, e.g.:
position.move.filter_gt(confidence, threshold=0.6)
: filter position based on confidence > 0.6, but really the name should be filter_by_other_gt_threshold
- I'm struggling to come up with a better one.velocity.move.filter_lt(position, threshold=[123,456])
: filter velocity based on invalid position < (123,456)Again this seems like an overkill, as mentioned before, we can simply do:
position.where(confidence > threshold)
velocity.where(position < threshold)
Plus having the DataArray accessor, e.g. ds.position.move.filter_
might confuse users more, since we have ds.move.compute_kinematics
.
Perhaps we could consider
ds["position"] = filter_by_other_ge_threshold(position, other=confidence, threshold=0.6)
ds["velocity"] = filter_by_other_ge_threshold(velocity, other=confidence, threshold=0.6)
# and for convenience
ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"])
likewise
ds["position"] = savgol_filter(position, window_length=1, polyorder=2)
ds["velocity"] = savgol_filter(velocity, window_length=1, polyorder=2)
# and for convenience
ds = ds.move.savgol_filter(window_length=1, polyorder=2, data_vars=["position", "velocity"])
Comments to @niksirbi's message
- get rid of filter_by_confidence function
I am not sure, I do think it is useful for users to wrap these lines, even if they are a few. They will be used quite a lot....
- recommend using native xarray.where syntax in the relevant examples
On the other hand, the .where
syntax is simple, clear and flexible (i.e. valid for any two arrays of same shape) which I really like.
- drop the @log_to_attrs utility
I think logging for reproducibility it's a cool idea that we should keep in mind, but agree that maybe we can revisit this once movement is pass the toddler phase :P Shall we make an issue to keep an eye on it?
- interpolate_over_time, median_filter and savgol_filter should accept and return DataArray objects, instead of Dataset.
100% yes 👍 👍
Comments to @lochhh
We didn't like the idea of a DataArray accessor, but I drafted it here to better visualise its usefulness (or not).
I really like this prototyping approach, thanks for doing it.
Again this seems like an overkill
I agree - it is quite clear when the syntax is compared directly.
Plus having the DataArray accessor, e.g. ds.position.move.filter_ might confuse users more
Agreed (it confuses me a bit at least).
Perhaps we could consider ...
IIUC, we are considering three main syntaxes for filtering by confidence:
.filter_by_confidence
: would filter using a specific data array (confidence) and a specific method (thresholding and taking greater or equal values).
.where
syntax: would filter based on any kind of condition defined on any data arrays of the same shape. So less constrained to a specific data array or method than the previous one..filter_by_other_ge_threshold
: somewhat in between. It would filter based on a specific method (thresholding with greater or equal) but on any data array of matching shape.So there is a choice in how much we constrain the function. My (mild) opinion is that we make "constrained" utils if they wrap a specific case that is used a lot. The justification being that it would save the user a lot of typing.
With this in mind, my thoughts atm are:
.where
syntax is attractive because it is flexible (which is good) and simple enough so that it does not require a wrapping. .filter_by_confidence
, just because it will be used a lot. But passing data arrays (rather than a data structure).filter_by_other_ge_threshold
-like) feels unnecessarily restrictive. For example, the user may filter by another data array, but not necessarily by 'greater or equal' (vs for confidence, where constraining by greater or equal will be the case 99% of the time).On the accessor method vs functional approach, I like the functional one a bit more. I find it less confusing, but that may be my specific background with programming (matlab cough cough).
To clarify, I was referring to having two main syntaxes for filtering:
(recommended) via the move
accessor, where (as @sfmig said) more constrained things happen, i.e.
ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"])
ds = ds.move.savgol_filter(window_length=1, polyorder=2, data_vars=["position", "velocity"])
Under the hood, these call for every data_var
, the more general (less constrained)
functions in the filtering
module that take DataArrays as input:
filter_by_other_<comparator>_threshold
, where <comparator>
can be any comparison operator (similar implementation),
savgol_filter
,median_filter
, etc.via the less constrained filtering
module
from movement.filtering import filter_by_other_ge_threshold, savgol_filter
ds["position"] = filter_by_other_ge_threshold(ds.position, other=confidence, threshold=0.6)
ds["velocity"] = filter_by_other_ge_threshold(ds.velocity, other=confidence, threshold=0.6)
ds["position"] = savgol_filter(ds.position, window_length=1, polyorder=2)
ds["velocity"] = savgol_filter(ds.velocity, window_length=1, polyorder=2)
aah gotcha! I don't feel strongly either way tbh... 🤔
What about just having these options?
# savgol & most filters:
# take data array as first input, rest of inputs are filter-specific parameters
ds["position"] = savgol_filter(ds.position, window_length=1, polyorder=2)
ds["velocity"] = savgol_filter(ds.velocity, window_length=1, polyorder=2)
# filter by confidence following the same structure for the function signature
# (confidence as a filter-specific parameter)
ds["position"] = filter_by_confidence(ds.position, confidence_array=ds.confidence, threshold=0.6)
# then use .where for more flexible conditions
ds["position"] = ds.position.where(
ds.position.sel(space="x") <= 6000 &
ds.position.sel(space="x") >= 2000
)
Thanks a lot @sfmig and @lochhh for the back and forth! These discussions are super useful. Below I give a summary of my current take on this after reading your messages. If I've misunderstood/misrepresented your views, let me know.
All functions in the filtering.py
module (which may be split in future) should accept 1 DataArray
as input and return the filtered DataArray
as output. So as in the examples you both gave:
ds["position"] = savgol_filter(ds.position, window_length=1, polyorder=2)
We won't plan to implement a DataArray
accessor. Operations on arrays will use the above functional paradigm.
I also like the idea of having accessor methods that would call the above filters, but can work on multiple arrays at once:
ds = ds.move.savgol_filter(window_length=1, polyorder=2, data_vars=["position", "velocity"])
This might look like "two ways of doing the same thing", but I don't see it as such. The whole point of having datasets is that you can group multiple arrays together, so it makes sense to provide methods that simplify operations on multiple related arrays.
So to summarise, functions in filtering.py
are purely functional, DataArray
in, DataArray
out, which is consistent with how the kinematics.py
module works. We can have accessor methods for the same thing, assuming that they provide the "added value" of conveniently operating on multiple arrays or needing to combine information from >1 arrays (which will be my next point).
filter_by_confidence
?This function, since it combines information across multiple arrays, makes sense to be implemented as an accessor method:
ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"])
I also expect this to be heavily used, so having a convenient entry point (even if it's "just a wrapper around where
") is useful.
That said, I don't see much value in having more generic functions of the filter_by_other_ge_threshold
type (neither as functions nor as accessor methods). If people want to do such custom stuff, they might as well just use where
.
Thanks, very nicely summarised, @niksirbi!
What about
filter_by_confidence
?This function, since it combines information across multiple arrays, makes sense to be implemented as an accessor method:
ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"])
I also expect this to be heavily used, so having a convenient entry point (even if it's "just a wrapper around
where
") is useful.That said, I don't see much value in having more generic functions of the
filter_by_other_ge_threshold
type (neither as functions nor as accessor methods). If people want to do such custom stuff, they might as well just usewhere
.
I feel like there would be a "surprise" element here: Why is filter_by_confidence
only available via the accessor, which is why I thought might be nice to have the more generic filter_by_other..
. On second thought, @sfmig's suggestion below would be a great fit, so we continue having both, consistent with the other filters.
# filter by confidence following the same structure for the function signature # (confidence as a filter-specific parameter) ds["position"] = filter_by_confidence(ds.position, confidence_array=ds.confidence, threshold=0.6)
The main benefit of preferring the above functions to the simpler .where
is that we keep a log of the actions.
I'm fine with that solution!
A clarification regarding the logs, I see how they'll still work if people use the accessor methods, but is there a way to preserve that when they are just using the DataArray
functions? I thought that wasn't possible, but your comment above suggests otherwise. Do Dataset
s inherit attributes from DataArray
s if say one of them get's filtered?
The attributes are attached to the DataArray instead of the Dataset, so you would need to do ds.position.logs
instead of ds.logs
. But because we have the accessor, we can implement logs
property for the Dataset which concats all data_vars.logs
.
Hmm nice idea, I could see that working... Since you already have some drafts going on, are you ok with undertaking to implement the changes we decided on here (assuming @sfmig agrees)? There's no particular rush on this, you may take your time.
nice, I like it! I agree with @niksirbi's summary, with the additional @lochhh's suggestion on having filter_by_confidence
in both formats. It feels logical and consistent 🤩
Is your feature request related to a problem? Please describe. Almost all filtering/smoothing operations are meant to operate on a single data array (in most cases the
position
data variable). However, they are currently designed such that they take in an entiremovement
dataset (soposition
+confidence
) and return a different dataset in whuch only theposition
variable has been updated.This made sense for our first filtering function -
filter_by_confidence()
, which needs both variables, but it no longer makes sense for the other filters likemedian_filter()
andsavgol_filter()
Describe the solution you'd like All filtering functions should explicitly accept a
DataArray
and return the modifiedDataArray
. This is the transparent, no-magic option. For the special case offilter_by_confidence()
, we will have to explicitly provide theconfidence
data array as a 2nd argument.Describe alternatives you've considered The alternative would be to leave the
filter_by_confidence()
as is, and only change the other functions. This would make the interface somewhat inconsistent though.Additional context Modifying
filter_by_confidence()
to accept bothposition
andconfidence
data arrays as separate arguments would be a bit awkward. Under the hood, we would have to validate that those match (in shape and labels), and this is almost equivalent to re-building amovement
dataset. It feels like we would be disassembling a dataset to pass it to this function, reassemble it internally to do the validation and thresholding, and then return onlyposition
.We've already touched on this topic in #162 but we didn't agree on a solution. Me and @niksirbi agree that the current status-quo feels counter-intuitive. Thoughts @lochhh and @b-peri ?