neuroinformatics-unit / movement

Python tools for analysing body movements across space and time
http://movement.neuroinformatics.dev
BSD 3-Clause "New" or "Revised" License
96 stars 8 forks source link

Define data model for representing points and trajectories #12

Closed niksirbi closed 1 year ago

niksirbi commented 1 year ago

Define custom classes for representing points (animal body parts) and series of points (animal trajectories) in space.

These could be sub-classes of np.record and np.recarray respectively, to access fields (e.g. 'x'. 'y', 'name', 'confidence') as attributes.

Note numpy.record and numpy.recarray are NumPy data structures for handling structured data with multiple fields per element. A numpy.record represents a single structured element, while a numpy.recarray is an array of such elements. The key difference between a regular structured array and a numpy.recarray is that fields in a recarray can be accessed using attribute notation (e.g., recarray.field_name) instead of indexing notation (e.g., array['field_name']). This provides more convenient syntax, but with slightly lower performance compared to structured arrays.

This is the approach SLEAP takes. We could also directly use or subclass the SLEAP objects.

niksirbi commented 1 year ago

In a discussion with @lochhh, we agreed to first try using the SLEAP data model. The main roadblock for this is SLEAP not supporting recent Python versions. The SLEAP developers suggested using sleap-io, a separate Python package which reimplements their data model and deserialization routines - see this thread.

talmo commented 1 year ago

Just adding some more thoughts on this -- we've gone back and forth a lot on the appropriate data structure for pose data.

SLEAP's object-oriented model is clean and Pythonic (it's basically a bunch of dataclasses), and maps well onto common serialization formats like JSON/YAML/HDF5. It also makes it easy to translate to standardized formats like NWB's ndx-pose. It's also flexible in that you can have variable numbers of instances per frame, and have the ability to link together attributes like tracks/identities or skeletons with individual animal instances.

The downside is that it's not always the most efficient depending on the access pattern. When you're doing labeling, random access creation of a single point or instance is necessary since users label one animal at a time. But imagine repeated serialization/deserialization -- if you have a Python object for every point, you're going to be instantiating hundreds of thousands to millions of little objects!

When you're doing complex queries, it's super inefficient. Consider the use case where you want to ask for all the frames in which there are N animals with body part pairs A and B within distance K of each other. This now requires a full iteration over all T frames (where T >> 1e6 oftentimes), and every instance within the frame, resulting in a O(T * N) operation -- assuming the labels are stored sequentially and not hashed by something else (like in multi-video projects).

I think the best of both worlds -- and what we'd eventually like to have in sleap-io -- would be to have a thin object-oriented access layer backed by a pandas DataFrame that has good support for cythonized or otherwise vectorized operations on the backend. Libraries like sqlalchemy achieve this to some extent, allowing for different access patterns via DAO/ORM/CRUD type patterns. Alternatively, just having different backends optimized for different use cases might be cleaner and reduce the abstraction overhead.

If you're going down the object-oriented model route, consider using a framework like attrs or plain dataclasses for readability and reducing boilerplate. See also these considerations with regards to performance and usability: [1] [2] [3]

In any case, give it a go for your test cases, benchmark it, and feel free to reach out if you need any feedback or have any for us!

niksirbi commented 1 year ago

Thank you for chiming in on this @talmo. Since this project is still in early development, we are fully open to discussing basic design considerations. We want to choose data structures that will not make our lives difficult down the line.

The SLEAP data model appealed to us precisely because of the flexibility you mentioned (and a desire to not reinvent the wheel), but the performance considerations may indeed become a bottleneck. Not so much for our envisioned alpha product (import, smooth and plot tracks) but definitely for more complex kinematic analyses like the example you mentioned.

I am keen to stay in touch and follow the developments over at sleap-io, given that your team has thought about these issues for much longer than we have.

For now, we will likely try adopting the sleep-io model as is, and implement changes on the backend as things evolve. If the backend approach you end up with is good enough for our needs, we are happy to adopt it. Otherwise, we'll have to design backends tailored to our needs.

Just out of curiosity, have you given dask much thought? We have benefited from Dask in other unrelated projects, but haven't yet thought through if/how to apply it to pose data. In case you have considered it and think it's a dead end, let us know.

Also thanks for the attrs references, I will read through and reconsider my use of Pydantic.

niksirbi commented 1 year ago

After some research and internal discussions, we decided to try using xarray.DataArray as a backend for pose tracking data.

DataArray is an N-dimensional generalisation of pandas Series.

Multiple DataArray objects can also be put into an xarray.Dataset, aligned along shared dimensions. For example we could create a Dataset corresponding to a collection of videos, with the pose tracks of each video being stored in a separate DataArray object.

xarray Pros xarray Cons
label-based indexing not as widely known as numpy/pandas
numpy-like vectorisation and broadcasting will require some learning for devs
pandas-like aggregation + groupby
Dask integration for parallel computing

I'lle give it a try and see if we can discover some unknown "cons" before we fully commit to it as a backend.

talmo commented 1 year ago

Would definitely recommend xarray over numpy recarray. If using this for prediction results only, then this should work great.

If using it for training data, I'd advise checking out some of the discussions in https://github.com/rly/ndx-pose/pull/9 for workflow-specific considerations. Basically, you may not want to over-optimize for timeseries since most annotation for pose is done in single images that are explicitly not consecutive in time.

niksirbi commented 1 year ago

Would definitely recommend xarray over numpy recarray. If using this for prediction results only, then this should work great.

Thanks for the input! Most of the things we want to do will operate on the prediction results only. movement is meant for post-SLEAP/DLC analysis, meaning we use already predicted poses as the input.