Need to re-design dataset validators?

niksirbi commented 3 weeks ago

The problem

Prior to PR #201 , a "movement dataset" was synonymous to a "poses dataset", because movement only supported pose tracking data. For this reason, we were using the ValidPosesDataset validator everywhere:

when loading data from file
prior to saving data to file
sometimes before computing a derivative (though that is now being removed, see #206).

Going forward, movement will support bbox-tracking data as well (and perhaps even other types in the future). We still would like the accessor to work both for poses and bboxes, i.e. both of those should still be a movement dataset. But this means we have to fundamentally re-design our validation strategy (we can't keep using the same ValidPosesDataset validator for everything).

Potential solution

Probably we will end up defining several "entities":

the movement dataset which is the "base" dataset, and its requirements should be minimal. Perhaps its self.validate() method should only check for the existence of the position and confidence data variables, and only require the space and time dimensions. Maybe these checks should be implemented as part of a ValidMovementDataset validator. If I'm not mistaken these checks are sufficient for all our current filters and kinematic variable computations to work, i.e. all accessor methods will/should work for a valid movement dataset.
the bboxes dataset which is a subcategory of a movement dataset and additionally requires a shape array. This should be validated via the ValidBboxesDataset validator. This validator should run when loading/saving bbox data from/to files, and before any operation that only works on bboxes. The individuals dimension should be optional perhaps, but that is a separate discussion (see below).
the poses dataset which is also a subcategory of movement dataset and additionally requires a keypoints dimension. This is based on the fact that "pose" is usually defined as a set of keypoints. The ValidPosesDataset could be used to validate such data during I/O and before a computation that only works on pose tracks.

The above arrangement has some kinks though. For example, what should we do in cases where only 1 keypoint is being tracked per individual (as in the Aeon dataset, for example). That's not a "pose" strictly speaking, but it can very well be accommodated within a poses dataset with a single keypoint. However, this raises the question of whether a singleton keypoint dimensions should exist in such case, as @vigji has raised, see this issue and this zulip thread). As an alternative we could agree that all point tracking data is "poses", and make the keypoints dimension optional (i.e. a poses dataset is essentially the same as the "base" movement dataset.

Related to the above, I think the individuals dimension, plus any other extra dimensions (like views), should be always optional, i.e. their presence/absence should not be validated by the dataset validators, and they should be only created and validated when and as needed (basically agreeing with what was expressed in the zulip thread).

The question is, can we restructure dataset validation in a way that accommodates something like the above scheme, with the kinks ironed out? I'm fully open to better ideas on this.

vigji commented 3 weeks ago

Just read this thread. As I am trying out locally things to move forward #197 (struggling with the test structure rn), I would probably make sure I do not end up finding solutions for the validator that are then overcome by a redefinition of those classes, what do you think @niksirbi ?

For what matters, I think it makes sense to start as early as possible to allow for dimensions optionality like the keypoints or the individual ones. But I do not know the classes structure in enough detail to really give an insightful opinion!

niksirbi commented 3 weeks ago

Hey @vigji, basically we have two design contraints right now, and both of them hinge on redefining the validators:

accommodate data from bbox tracking as well as pose tracking experiments
allow flexibility in number of dimensions (make many of them optional), which is what you brought up

I think it would be ideal, if the re-designed validators solve both problems in one sweep, especially because they are somewhat inter-related. I agree with you that it's better to tackle such issues early rather than when the project is more mature. This means that the validators + io functions are about to undergo an unstable period till we settle on a new structure that works.

Regarding your experiments in #197, I'd say feel free to continue experimenting on point 2, but don't worry about getting any of the code "camera ready" just yet, because likely we'd have to alter it to match the ongoing changes.

Regarding the structure of tests, is there anything we can do to help? I'd be open to hopping on a quick zoom cal some time next week if that'll help clarify things.

vigji commented 3 weeks ago

Regarding your experiments in https://github.com/neuroinformatics-unit/movement/pull/197, I'd say feel free to continue experimenting on point 2, but don't worry about getting any of the code "camera ready" just yet, because likely we'd have to alter it to match the ongoing changes.

Ok!

Regarding the structure of tests, is there anything we can do to help? I'd be open to hopping on a quick zoom cal some time next week if that'll help clarify things.

I'll dm you on Zulip :)

niksirbi commented 2 weeks ago

@b-peri had a good idea that might help with this:

When we load data, we know what type it is (poses or bboxes), so we could add a dataset attribute (e.g. ds.tracking_type) that keeps that information. Subsequent validation can be done taking into account the value of this attribute. For example:

if ds.tracking_type == "bboxes" check that the shape data variable is present.

The more general validation steps (e.g. existence of space and time dimensions) can run independently of the value of this attribute, while more specialised validation will depend on the tracking type.

neuroinformatics-unit / movement

Need to re-design dataset validators? #210

The problem

Potential solution