Data compression - Githubissues

Lestropie commented 3 years ago

Given recent comments in dipy/dipy#2229, and further back in nipy/nibabel#942, I think it is worth dedicating a discussion thread to the issue of data compression.

I think it is fair to say that this particular format proposal has prioritised extensibility and simplicity over compression. But for some users / developers the latter may be the higher priority. So here we can discuss whether the gains in this department from the current TRX proposal are adequate, whether it is possible to bootstrap any kind of explicit data compression into this format proposal, or whether a drastically different proposal would be necessary in order to satisfy such demands.

Please amend or correct as necessary, and comment as you see fit.

Where compression may be possible in TRX:

Use of float16's for vertex locations -> halves storage over float32
No need for delimiters -> one less vertex per streamline compared to delimiter-based formats, though explicitly storing offsets reduces that gain slightly
Compatible with downsampling, i.e. storing less vertices per streamline than what was used in generation of the streamline; this can be a naive integer factor reduction in vertex count, or based on estimation of error accumulation (e.g. https://www.sciencedirect.com/science/article/pii/S1053811914010635); in either case for algorithms with small step size this can be the primary source of storage reduction
Zipping of full directory structure (albeit depending on implementation this may impact IO substantially), though yields with this tend to be minimal due to the nature of the data
Support for more dedicated compression strategies as in-place alternatives for vertex position (or indeed DPP / DPS / DPG) data within TRX.

Point 5 may be interesting to explore: there's been a number of manuscripts published explicitly on storage reduction of streamline vertex information, so simply having the option to utilise within TRX one such format for vertex position information would theoretically be possible. It would make the technical specification considerably more complex, but technically one could defer to external specifications for datasets doing such. I'm not a fan of this personally, I'm simply stating that it's a prospect.

In the converse case, i.e. a format from the ground up that would prioritise compression: A couple of these already exist, but don't have the extensibility of TRX. Trying to extend the capabilities of compressed streamline data storage could look something like bootstrapping compression into the .trk format, or it could be something entirely novel. It probably falls upon someone who prioritises such to propose something that still has a chance of satisfying the desires of those prioritising extensibility. But this is likely one of multiple dimensions of difference that caused the original discussion in nipy/nibabel#942 to go slightly astray, so warrants greater precision here.

frheault commented 3 years ago

This is a very important discussion to have indeed and I agree with everything that you said.

At the SCIL we compress all tractogram using linearization only by default and I personally worked on it. That's a easy 10x speed up / size reduction at minimal cost (if following algorithms are taking it into account).

One additional point, all data per vertex/streamlines in trk are in float32 and so the division can be by a factor 2 (float or int 16, or 4 for int8 or 32 if .bit). This can add up to a lot.

A linearized, float16 plus metadata with well though dtype can be 20-30 smaller than a typical 0.2mm step size TRK. No matter what end up being the final specification of the file format I believe the most important speed up will be from reducing streamlines data via lossy compression rather than the others format design decision.

frheault commented 3 years ago

@Lestropie also, I thought a bit about QFIB . In a future iteration of TRX (or maybe right now) it could be possible to allow to have either positions.3.float16 or a initial_positions.3.float16 and a sphere.3.float16 and a vectors_indices.uint16

I don't know if you are familiar with QFib (lossy) compression. Its basically just starting from a single point you then step forward using a parcellated sphere. Instead of writing the vector at each step, you can write a single integer mapping the sphere's vector.

This effectively divides by 3 the size of the file AND it can be combined with linearisation. However, this obviously comes with a slow down on access because you have to compute the real positions on-the-fly.

I think that could be a nice future support if people REALLY want smaller files, it could be compatible with the file format. I would personally recommend this only for data storage or data transfer (And then uncompress back to standard positions.3.float16 for manipulation).

Basically, my hypothetical recommendations:

First, everything in float16 (factor 2)
Then, linearisation (factor 10-20, step-size dependend) This is already more than enough in my book. It compresses metadata too
If needed, QFib. Does not affect metadata (factor 3 for streamlines only)
Long term storage, if needed, ZIP_STORE (factor 1.5)

soichih commented 3 years ago

@frheault What exactly do you mean by "linearisation"? Are you referring to simply dropping vertices that can be dropped without sacrificing too much precision?

frheault commented 3 years ago

@soichih Pretty much, it consists in removing 'mostly' colinear point. https://pubmed.ncbi.nlm.nih.gov/25592997/ And more specifically just about linearisation https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5483435/

The idea is that having 100 points on a straight line is not worth it if you can drop to 5-10 points with a margin of error of 0.1mm. What is nice about that is that it does not require any change to TRK,TCK,VTK or TRX.

soichih commented 3 years ago

Thanks. @frheault That makes sense.

While on this subject though.. is there an algorithm that compresses tractography by turning coordinates into smaller set of bezier/spline coordinates? I think a curve will fit better to the original data points - assuming that brain tractograms are normally a smooth curves.

Also, do you know of an algorithm that not only compresses each streamlines, but it will compress across multiple streamlines? When the number of fibers increases, there will be more and more similar streamlines. An algorithm could then apply similarity based huffman encoding to only store full vertex information.. If there are no such algorithm, I'd like to write one over the holiday break for fun.

frheault commented 3 years ago

An approach based on bezier/spline, I remember having a review on my paper telling me to mention it (I don't think I did). While interesting, this does change drastically how people interact with data, also streamlines are not always smooth so spline fine tunning is harder than one might think. Evaluation of the spline to get 3D points will slow down the conversion. I believe this makes it way less interesting as a possibility.

The QFib approach does slow down, but at least conversion back to full 3D points representation is likely faster. And I can imagine a way to make it fit inside of the TRX structure with small modification.

For your last idea, I actually did implement something like that for bundles, does not work for tractogram and scale very badly https://github.com/scilus/scilpy/blob/master/scripts/scil_remove_similar_streamlines.py. That's an hard problem, if you have time to actually make that work well at the tractogram level, that would be very nice !

I think an approach with Quickbundles a multiple clustering distance (e.g ranging from None to 2,4,8,16,32mm) would create a simple 'multi-resolution' that could then be hidden 'easily' in the data_per_streamline. Making it easy to load the appropriate/desired resolution by the user. (Of course, I am seeing it from the point of view of TRX or backward compatibility with TRK)

I don't know something that is actually dedicated to this idea. I could easily imagine an algorithm that stores reference streamlines and the others are represented as a vector of delta/difference. There is a lot of room for great compression algorithm, however the hard part (if you want it to catch on) is to make one that is compatible with TCK/TRK or maybe a future TRX.

If you are planing on working on a streamlines compression algorithm, I would 'challenge' or 'recommend' that you try to make so its compatible with a representation that would 'fit' within the specification of TRK (with metadata) or TRX (which could be a bit more flexible). That way if it works at least it could be used right away by the ones that wish to further compress data.

Lestropie commented 3 years ago

An approach based on bezier/spline, I remember having a review on my paper telling me to mention it (I don't think I did). While interesting, this does change drastically how people interact with data, also streamlines are not always smooth so spline fine tunning is harder than one might think. Evaluation of the spline to get 3D points will slow down the conversion. I believe this makes it way less interesting as a possibility.

I think maybe this concept needs to be broken into two separate components:

Your response here suggests that you are interpreting the proposal as one of transformation of streamlines from an ordered set of vertices in 3D to some fundamentally different basis, e.g. polynomials in each axis or Fourier coefficients or the like. As you say, this has been done in certain contexts, and I agree that this would be too severe a change in representation to be considered as a core component of something like TRX.

If I instead interpret the original question through my own personal lens:

is there an algorithm that compresses tractography by turning coordinates into smaller set of bezier/spline coordinates? I think a curve will fit better to the original data points - assuming that brain tractograms are normally a smooth curves.

In MRtrix3 we have for many years implicitly interpreted streamlines as points on a Hermite spline (publication; see Appendix 3):

If generating a TDI where the voxel spacing is smaller than the tractography step size, we increase the number of vertex samples along that implicit Hermite spline (code link);
For applications where not only the binary presence of an intersection between a streamline and a voxel, but also the length of the segment of that streamline within that voxel is necessary (code link), then upsampling on the spline prior to voxel mapping provides superior quantification of arc length rather than chordal length of the streamline-voxel intersection;
We further use an algorithm for mapping streamlines to voxels whereby even if two adjacent streamline vertices still "jump over" a voxel, intersection of the streamline with that middle voxel will still be captured, based on stepping along the continuity of that Hermite spline between voxel faces (code link).

I think what @soichih is actually reaching at here is, instead of doing a basic linearisation where consecutive vertices are skipped if they are sufficiently collinear with the previous sample & tangent, re-pose the problem as one of minimising the number of vertices stored for a streamline whilst keeping the distance between the continuous spline inferred from those samples and the continuous spline inferred from the original densely-sampled streamline below some threshold.

This would be a really cool project. It's similar to the concept of linearisation, but it would involve optimising the complete stored representation of each streamline rather than a single pass-through the data retaining or rejecting vertices, and it would be based on an inferred smooth trajectory rather than piecewise linear. Since the stored output would still be an ordered set of 3D vertices, existing file formats could still be used; the only condition for correct interpretation of the stored data would be that the receiving software impose the appropriate spline model.

tee-ar-ex / trx-python

Data compression #8