tee-ar-ex / trx-python

Python implementation of the TRX file format
https://tee-ar-ex.github.io/trx-python/
BSD 2-Clause "Simplified" License
22 stars 15 forks source link

NPY file format for regular matrix data #21

Open Lestropie opened 2 years ago

Lestropie commented 2 years ago

This idea was going to result in #15 getting peppered with repetitive comments, so I'm going to write it here separately instead.

TRX currently has novel handling of matrix dimensions & datatype for various data files, achieved via file names. When looking through the code in #15 I also see what looks like novel enumeration / single-character encoding of data type. This may be creating a novel solution for a problem for which many solutions already exist.

The NPY format provides an established solution for these issues. Matrix dimensions and data type (including endianness) are encoded in the file header as part of a dictionary literal. I've myself recently implemented C++ support for that format in https://github.com/MRtrix3/mrtrix3/pull/2437. Using this file format as part of the higher-order TRX format would be fairly trivial for Python, in particular facilitating reading / writing of data with no dependence on TRX libraries, and for other languages the overhead would be no greater than that demanded by the current specification. Potential downsides are that features such as matrix dimensionality / size and data type would no longer be visible from a filesystem view (though they could be pretty easily seen just using head), and memory-mapping implementations would need to support loading from a non-zero offset into a file (which shouldn't be difficult, it's a common operation). But the upsides in terms of not reinventing the wheel may more than offset that.

arokem commented 2 years ago

I think that's potentially a good idea. Do you happen to know the state of npy support in other languages? In particular, Javascript and Matlab seem to be relevant.

Lestropie commented 2 years ago

Hadn't looked into it previously. It's a pretty simple format so I wouldn't expect that magnitude of code required for compatibility to be huge.

For Matlab this seems to be the leading candidate, but memory-mapping is limited due to row-major vs. column-major ordering compatibility. For Javascript it looks like there is multiple pursuits of a NumPy-comparable library rather than just interfacing with the data format itself.

Lestropie commented 2 years ago

Bit of background on this issue listing:

I've been working for an extended period of time on defining diffusion models as part of BIDS Derivatives (https://github.com/bids-standard/bids-bep016), and that is now part of a bigger project that aims to extend the concept to a range of connectivity-based derivatives, including tractography and tractometry.

To me, it might not make sense to define TRX as "one of many possible formats" to be used for tractography data, specifically because it encapsulates concepts like data per vertex / data per streamline / streamline groups and data per group that will be requisite for derivative data standardisation whereas others don't. Trying to add such functionalities to pre-existing formats might end up with a duplication of the effort that's gone into TRX. And it would potentially make the specification unclean if having to explain not only different data formats, but also drastically different scopes of those formats, and ways in which one might need to post hoc adopt concepts of one format to overcome the limitations of others. Having TRX as "the" tractography format for BIDS derivatives might actually be cleaner long-term.

But BIDS Derivatives is long-term going to include all sorts of things, and IMO the more basic principles that can be identified and adhered to from the outset the better. The current TRX mechanism for identifying datatype and number of columns is entirely context-specific. Conversely, I can foresee NPY being potentially applicable to a very wide range of 1D and regular 2D data; tractography would just be one specific use case of that format.

So transitioning to NPY might give TRX a better chance of taking a more prominent position in the BIDS Derivatives context, and might contribute to accelerating software support. It's not my dictatorial decision to make, and it's very hypothetical, but I will be registering the BIDS Extension Proposal ID for tractography some time soon, so this is an impending decision that's on my mind.