tee-ar-ex / trx-python

Python implementation of the TRX file format
https://tee-ar-ex.github.io/trx-python/
BSD 2-Clause "Simplified" License
20 stars 16 forks source link

Thoughts on storing data with tabular file formats? #60

Closed psadil closed 9 months ago

psadil commented 11 months ago

Thanks for all the hard work on this format! This is a follow-up to a question from the presentation given in the Open Science Room at OHBM (2023). If this is not the right place for the follow-up, please feel free to move it.

I’m speaking as someone that is mainly just a user of tractography data, but also as someone with a decent amount of experience in the data science ecosystem. From that perspective, parts of the development for the TRX format look like they are duplicating parts of the development of modern tabular formats. I wonder about trying to leverage those tools, particularly Apache Arrow (and the corresponding file format Apache Parquet) and DuckDB.

For example, the dps and dpp subdirectories look very similar to a standard arrow table, with the files contained in the folders corresponding to columns. Arrow/parquet has solid abilities to work with data on disk, allows control over the datatype of the columns, supports metadata, and is implemented in several languages (e.g., C++, Rust, Java, Julia, Go). In addition, the format brings several niceties: packages exist in higher-level languages for performing analyses (py-polars, ibis, R, MATLAB, JavaScript), most of the high-level packages are built to allow efficient multi-threading, there is cuda support, and the arrow/parquet format is actively used and tested by a wide community.

Using something like Arrow adds a dependency, which is a cost. However, I wonder if this may be worthwhile, especially if it could make it simpler to develop and maintain lower-level implementations of the TRX format, like the C++ one.

There are different ways this use of a modern tabular format could look. One approach could involve replacing TRX subfolders with individual parquet files. Another might involve trying to link information across metadata and subfolders through a duckdb. But I wonder what the initial impressions are for this suggestion, or what kinds of demonstrations or evidence the community might find informative.

Thanks!