Open niksirbi opened 2 weeks ago
Is the aim to pick:
Sorry, accidentally opened this issue without a description. I'm in the process of editing it now (it's going to be extensive).
EDIT: Issue description has been updated now.
FWIW @niksirbi I agree with your summary. Although I fully expect to agree with the next person who makes a reasoned argument!
I expect this issue to be less important over time as we gradually support more formats.
just putting myself here to follow the thread. This will be very relevant to VAME, as we might incorporate movement's standard to VAME's intermediate data steps as well as data ingestion
I did a little experiment to test saving movement
dataset to netCDF
files, and it works as expected, apart from the fact that attributes have to be made serialisable. The attrs
could be sanitised with a thin wrapper, or alternatively we could take care to only define attrs
is serialisable formats to begin with.
import tempfile
from pathlib import Path
import xarray as xr
from movement import sample_data
ds = sample_data.fetch_dataset("SLEAP_three-mice_Aeon_proofread.analysis.h5")
print(ds)
# A temporary path to save the data
temp_dir = tempfile.TemporaryDirectory()
temp_dir_path = Path(temp_dir.name)
save_path = temp_dir_path / "saved_data.nc"
# Make all attrs serializable (for netCDF)
for key, value in ds.attrs.items():
# Convert Path objects to strings
if isinstance(value, Path):
ds.attrs[key] = str(value)
# Convert None to empty string
elif value is None:
ds.attrs[key] = ""
# Save the data to a netCDF file
ds.to_netcdf(save_path)
# Load the saved data from the netCDF file
loaded_ds = xr.load_dataset(save_path)
print(loaded_ds)
# Check that the loaded dataset is identical to the one we saved
assert ds.identical(loaded_ds)
# Clean up
temp_dir.cleanup()
This write-up was prompted by this zulip topic.
The problem
We have so far taken a pluralistic approach to file formats, i.e. we load and write to multiple formats (as interoperability is at the core of our mission). That said, our existing saving functions are essentially limited to DeepLabCut and SLEAP files, which means we can only save pose tracks + associated confidence scores. We are also in the process of adding support for
ndx-pose
, but the scope ofndx-pose
is also limited to pose estimates (and their associated training data).I think it's high time to decide on a
movement
-native format for saving our datasets to file. Requirements for this file format:movement dataset
- including poses, bounding boxes, their associated metadata, as well as any variables/metrics derived from them (e.g. speed, head direction, etc.).movement
Candidate formats
These are the ones I've thought of so far, feel free to add to this list.
xarray-supported formats:
netCDF-4
andzarr
netCDF files are essentially HDF5 with a specific data model (see this paper about the HDF5-netCDF relationship). This format is popular in geosciences, especially atmospheric and oceanographic data, but should support any grouping of scientific arrays with metadata. This would be the easiest and most natural format for us, given that
xarray
was explicitly build around the netCDF data model, and it offers an in-built to_netcdf() method.Pros:
Cons:
.nc
file is.xarray
also offers methods to save data tozarr
, see existing issue. I won't discusszarr
separately, because its pros and cons are similar tonetCDF
.In summary, if we go for an HDF5-like format, it should be
netCDF-4
, and we might as well offer thezarr
option.Parquet
See existing issue, and the related discussion in the idtracker GitLab.
Apache Parquet is an open-source, columnar storage file format designed for efficient data storage and retrieval. It's supported in Python via the
pyarrow
library.It's supports compression, (probably) metadata storage, and allows for efficient read access. It's favoured by @roaldarbol who develops
animovement
. Cons are the same as for netCDF and zarr, plus we'd have to implement function's for going back-and-forth between thexarray
and the "tidy" representations.csv
This is the most 'transparent' option: almost everyone (in research) is familiar with it, and users can easily inspect and edit the files without installing any software (a text editor is enough).
Its cons should be obvious from the above discussion: no compression, chunking, or metadata support.
We already sort-of support csv, since
save_poses.to_dlc_file()
can write a DLC-style csv files. But as discussed in the Parquet issue, we'd ideally want a "tidy" dataframe (inpandas
) which we can the export to either Parquet or csv formats (which can in turn be read byanimovement
).My current take
netCDF
should be the defaultmovement
-native format, as it would allow us to seamlessly read/write all the info contained withinxarray
objects, without doing any work (I think, remains to be tried). It should be best for "internal" uses, i.e. writing intermediate derivatives (e.g. filtered data) and loading them later for other analysis steps.animovement
(+ maybeidtracker
), and is perhaps more intuitive to some people (compared to multi-dimensional labelled arrays). There should also be an option to export the tidy dataframe to csv. Despite .csv 's inefficiency and (and clunkiness when it comes to storing metadata), I expect many users will be eager to just open the file in excel / google sheets and the like.