Tractography Data Format

francopestilli commented 4 years ago

It would be terrific to start a conversation about an agreed-upon data format for tractography. @arokem @Garyfallidis

mrneont commented 4 years ago

Hi-

I am interested in this. I have worked on the tractography tools in AFNI (with Ziad Saad).

I imagine FSL developers would be interested, as would Frank Yeh @frankyeh of DSI-Studio and the developers of MRtrix3.

Thanks, Paul Taylor

francopestilli commented 4 years ago

Yep! We should have this as an open discussion.

frankyeh commented 4 years ago

Love to see a new format standard. TRK file has been given me a lot of headaches and limited a lot of possible extension. DSI Studio will surely support any open standard for tractography.

effigies commented 4 years ago

While I don't necessarily think that a new tractography data format must be constrained by the NiBabel API, @MarcCote did put a fair bit of time and thought into the streamlines API. It might be worth thinking through whether this API is sufficient or if it's missing something.

Part of my thinking is that once a sufficient API is settled on, we can turn that around to quickly prototype more-or-less efficient implementations.

For what it's worth, I recently explored the API a bit and tried to summarize a bit in the following:

https://github.com/effigies/nh2020-nibabel/blob/ef3addf947004ca8f5610f34e767a578c4934c09/NiBabel.py#L821-L911

MarcCote commented 4 years ago

I totally agree with @effigies. Adding more people to the discussion. @frheault @jchoude

francopestilli commented 4 years ago

@MarcCote @frheault @jchoude @frankyeh great! @effigies loading (partial data loading) and efficiency will be critical for the format.

mrneont commented 4 years ago

Trying to include a few other potentially interested people: @bjeurissen @jdtournier @neurolabusc (Not finding an obvious FSL-tracking contact via github ID-- could someone else please help with that?)

mrneont commented 4 years ago

@effigies -- would be great to try that API with a demo.

Some functionality we value is keeping track of tracts as bundles-- if we put in some N>2 targets, we often care about any pairwise connections amongst those as separate bundles, because we are basically using tractography to parcellate the WM skeleton. Does that labeling/identifying of groups of tracts exist there?

effigies commented 4 years ago

The FMRIB contacts I know are @eduff and @pauldmccarthy. They might be able to point us to the right people...

neurolabusc commented 4 years ago

Before creating a new format, it might be worth considering the existing formats and see if any could be enhanced or improved, similar to the way that NIfTI maintained Analyze compatibility while directly addressing the weaknesses. The popular formats seem to be:

BFloat and Bfloat.gz used by Camino.
MRI studio's DAT format.
PDB format is used by CINCH and DTI-Query.
TCK format is used by MRTrix
TRACT format used by AFNI.
TRK format is used by TrackVis as well as dsi studio's compressed TRK.GZ. A nice feature is that this format includes support for bundles as properties (e.g. distinct bundles of tracks) as well as scalars (e.g. different z-scores along the length of a single track).
Legacy VTK format (which have the filename extension .FIB or .VTK).

Unlike triangulated meshes, tractography can not benefit from indexing and stripping, so the existing formats all seem to be pretty similar (all describe node-to-node straight lines, not splines).

I concur with @mrneont that it is nice to have the ability to describe tracks as bundles. I think this makes TRK the most attractive choice (some other formats are not well documented, so they may support this feature).

Perhaps @frankyeh can expand on his frustrations with TRK. What are the limitations? Likewise, perhaps everyone (including @francopestilli who started this thread) can provide a wish list for desired features.

Garyfallidis commented 4 years ago

Hi all,

And thank you for bringing this up. I have to say this a recurrent topic. Every one or two years this re-emerges.

I suggest before you do anything else study what is already developed in the existing API in nibabel.

Marc-Alex and myself worked quite a bit to support all the basic needs. Accessing bundles fast, adding properties etc. Actually we have already implemented a fast version that can load/save tracks to npz (a numpy ) format. Which you can use if you have big data.

For me the main decision which requires feedback from the community is the formatting technology. Do you want to save the end result using json, hdf5, gITF or something else? If we can decide on that. Then we are set. The work to study previous formats is already mostly done at least on my side.

Nonetheless, see also a recent paper for a new format called Trako https://arxiv.org/pdf/2004.13630.pdf

frheault commented 4 years ago

It is important to mention that no matter the file format, the main problems when it comes to standard will remain. Most people have a lot of trouble with TRK because you can mess up the space or the header or both. But the same remains true for vtk/tck, one could always write the TCK wrong (data) and/or provide the wrong nifti as a reference for the transformation.

No matter the new format, the same difficulties will remain. There is a thousand ways to write a TRK wrong, but many write it wrong and read it wrong too and it can work in their software. I think I was added due to my contribution to Dipy (StatefulTractogram). I think that no matter the new format, as long as people can have header attributes such as:

Space (VOX, VOXMM, RASMM)
Origin (of voxel, CORNER, CENTER)
Affine (vox2rasmm or inverse)
dimensions (shape of the diffusion from which the tractogram was computed)
Voxel size (For verification with the affine)
Voxel order (For verification with the affine)

I think I will be happy. For example, in my own code I used @MarcCote API to write an HDF5 format in which the length, offset, and data of one or multiple 'tractograms' are saved. So I can easily read any of these tractograms (I use it for connectomics, also could be used for bundles) and one could achieve the same to read any streamlines in particular. But as long as the attribute listed earlier are available anything can be done after that.

Also, if a new format is added to dipy. If it is statefulTractogram friendly it can easily be converted back to other commonly supported formats (TCK, TRK, VTK/FIB, DPY). If these efforts are made to have more efficient reading for computation, I think there is no problem with supporting more format. If the goal is to reduce confusion in how to write/read the format, I believe that a new format would not/never help. The unstructured nature of tractogram (not grid-like) makes it harder since the header and data are not fixed together when it comes to the spatial coherence.

PS : I personally think TRK is fine, everything is in the header. The problem is the variety of ways people can write/read it wrong. Making support across tools and labs pretty difficult. However, I think a strict approach to reading/write in dipy was beneficial on the long term. Short term, sure maybe half of the user probably hate me on some level, but a think a strict TRK (or at least a TCK always with the nifti that generated it) is superior than a lot of format. Just in term of available info, not for large scale computation visualisation and fancy reading/writing.

arokem commented 4 years ago

At the danger of somewhat rerouting the conversation, I guess since this has come up, this might also be a good time to discuss whether it would be beneficial to "upstream" the StatefulTractogram that @frheault has implemented in DIPY into nibabel. It's been "incubating" in DIPY since April 2019 (https://github.com/dipy/dipy/pull/1812), with a few fixes and improvements along the way (e.g., https://github.com/dipy/dipy/pull/2013, https://github.com/dipy/dipy/pull/1997) that have followed rather closely the continuous improvements that have happened in nibabel. When Francois originally implemented this, we discussed the possibility of eventually moving the SFT object implementation here (see @effigies comment here about that: https://github.com/dipy/dipy/pull/1812#issuecomment-488664324). As someone who has been using SFT quite extensively in my own work, I find it rather helpful (where previously I was struggling with all the things that @frheault mentioned above). So I think that there could be a broader use for it. Just to echo @frheault's comment: a benefit of that would be that you could move between SFT-compliant formats with less fear of messing things up. I guess the question is one of timing and of potential future evolution of the SFT object. What are your thoughts, @frheault, (and others, of course)?

And to bring the discussion back to its start -- @francopestilli -- I am curious to hear: what needs are not currently addressed and prompted your original post here? Are you looking to store information that is not in the currently-supported formats? Or is there some limitation on performance that needs to be addressed?

arokem commented 4 years ago

Oh - just saw the mailing list messages and now understand where this originated (here: https://mail.python.org/pipermail/neuroimaging/2020-July/002161.html, and in other parts of that thread). Sorry: hard to stay on top of everything...

frankyeh commented 4 years ago

Oh - just saw the mailing list messages and now understand where this originated (here: https://mail.python.org/pipermail/neuroimaging/2020-July/002161.html, and in other parts of that thread). Sorry: hard to stay on top of everything...

The 10-million tractogram mentioned in that original thread is quite a challenge, and I am not sure if a new file format would be the answer. Emanuele already achieved 1 min loading time with TRK (pretty impressive).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jdtournier commented 4 years ago

I'm looping in Rob Smith (@Lestropie) into this conversation, this is something we've discussed many times in the past.

I've not had a chance to look into all the details here, but here's just a few of my unsolicited thoughts on the matter, for what they're worth:

The issue of loading a 10M streamline tractogram into memory is in my opinion independent of the file format - it's about internal in-memory data representation, and as shown by the TRK handling mentioned above, different implementations can handle the same format very differently.
Simplicity: essential it it's to be accepted as a standard. It should be relatively easy to code up import/export routines from any language, without relying on external tooling. As also mentioned by @frheault, there's lots of ways of storing these data wrong, no matter the format, so it's important to minimise any unnecessary complexities in the format, and be explicit about the conventions used. For example, the tck format used in MRtrix stores vertices in world coordinates and has only 2 required elements in its header: a datatype specifier (almost invariably Float32LE), and an offset to the start of the binary data. I can't think of anything else that would be classed as necessary here (though of course in practice there's lots of additional useful information that we want to store in the header).
Space efficiency: these are likely to contain very large amounts of data, and these should take no more space than is strictly necessary to store the information. The type of geometry and/or data layout is known in advance, so I don't think it makes sense to try to use more generic container formats like VTK or HDF5 - these will likely require more space to describe the geometry / layout. Text formats are also likely to be inefficient from that point of view.
Load/store efficiency: loading large amounts of data will take even longer if the data need to be converted, especially to/from text. Ideally it should be possible to read() / write() the data into/out of memory in one go, and even better, memory-map it and access it directly. This implies storing in IEEE floating-point format, most likely little-endian since that's the native format on all CPUs in common use. We could discuss whether to store in single or double precision, but I don't expect there will be many applications where we need to store vertex locations with 64 bits of precision - in fact, I wouldn't be surprised if the discussion goes the other way, with the possibility of using 16 or 24 bit floats instead (though these would require conversion and could potentially slow down the IO).
Independence: I think it's critical that the format is standalone, and independent of any external reference. Having to supply the coordinate system for the tractogram by way of a user-supplied image would in my opinion massively expand the scope for mistakes. I don't mind so much if the necessary information is encoded in the header, as suggested by @frheault above - but I don't see that it adds a great deal to simply storing the data in world coordinates directly. I do appreciate that it probably matches the way data are processed in many packages, where everything in performed in voxel space. In MRtrix, everything is performed in real space, and the fODFs / eigenvectors are stored relative to world coordinates also, so there's no further conversion necessary - I appreciate not all packages work that way. And for full disclosure: we (MRtrix) would have a vested interest here since storing in anything other than world coordinates would probably mean more work for our applications.
Extensibility: we routinely add lots more information in our tractogram headers as the need arises, and I expect there will be many applications where the ability to store additional information more loosely will be useful. A standard format should allow for this, and also allow for additional entries in the header to become part of the official standard if & when their use becomes commonplace.

So that's my 2 cents on what I would like a standard tracrography format to look like. You'll note that I've more or less described the tck format, and yes, there's a fairly obvious conflict of interest here... 😁

However, there's clearly features that the tck format doesn't support that others do, though I've not yet felt the need to use them. The ability to group streamlines within a tractogram is interesting. I would personally find it simpler to define a folder hierarchy to encode this type of information: it uses standard filesystem semantics, allows human-readable names for the different groups, and allows each group to have its own header entries if required. Assuming the header is relatively compact, it also shouldn't take up a lot more storage than otherwise. And it allows applications to use simple load/store routines to handle single tractogram files, and trivially build on them to handle these more complex structures as the need arises. Others may (and no doubt will) disagree with this...

Another issue that hasn't been discussed so far is the possibility of storing additional per-streamline or per-vertex information. That's not currently something that can be done with the tck format, though it may be possible with others. This is actually the main topic of conversation within the MRtrix team. We currently store this type of information using separate files, both because our tck format isn't designed to handle it (probably the main reason, to be fair), but also because it avoids needless duplication in cases where several bits of information need to be stored (this was also one of the motivations for our fixel directory format). For example, if we want to store the sampled FA value for every vertex, we could store it in the file, but what happens if we also want to sample the MD, AD, RD, fODF amplitude, etc.? It's more efficient to store just the sampled values separately alongside the tractogram, than to duplicate the entire tractogram for each measure just so it can reside in the same file. Alternatively, we could allow the format to store multiple values per vertex, but then we'd need additional complexity in the header to encode which value corresponds to what - something that's inherently handled by the filesystem if these values are stored separately. And on top of that, a format that allows for these per-streamline and/or per-vertex information would necessarily be more complex, increasing the scope for incorrect implementations, etc. Again, this is a likely to be a topic where opinions vary widely (including within our own team), but my preference here is again to keep the file format simple and uncluttered, and rely on the filesystem to store/encode additional information: it's simpler, more flexible, leverages existing tools and concepts that everyone understands, and avoids the need for additional tools to produce, inspect and manipulate these more complex datasets.

OK, that's the end of my mind dump. Sorry if it's a bit long-winded...

eduff commented 4 years ago

Probably the best person to bring in regarding tractography formats from FMRIB would be Saad Jbabdi (or possibly Michiel Cottaar @MichielCottaar).

neurolabusc commented 4 years ago

I think @jdtournier did a nice job of describing the tradeoffs, and also agree that IEEE-754 float16 (e.g. GLhalf) probably provides more than sufficient precision, which would improve space and load/store efficiency. Thanks @Garyfallidis for noting the Trako format. It clearly achieves good space and load efficiency. It seems weak on the simplicity and store efficiency metrics - on my machine (which has a lot of Python packages installed) the example code required installing an additional 110Mb of additional Python packages, and the current JavaScript code only decodes data. So, at the moment a promising proof of concept, but not without tradeoffs. @jdtournier also makes an interesting case that perhaps scalars should be separate files (e.g. NIfTI volumes) from the tractography file).

Garyfallidis commented 4 years ago

@jdtournier the support of tck is already available in nibabel and dipy. If your claim is to just use tck then the answer is that many labs are not satisfied with the tck format. If they were fine then we would just use tck.

The effort here is to find a format that will be useful to most software tools. Nonetheless, if you look into the current implementation you will see that the tractograms are always loaded in world coordinates. But the advantage here is that you could have those stored in a different original space in the format. As about storing other metrics I think we still need that information because a) many labs use such a feature, b) if you store the data on other files then you always have to interpolate and perhaps the interpolation used is not trivial. Also you could have metrics that are not related to standard maps such as FA etc. You could have for example curvature saved for each point of the streamline. Would you prefer curvature being saved as a nifti file? That would not make sense right?

Garyfallidis commented 4 years ago

My suggestion to move forward is that @frheault who has studied multiple file formats and found similarities and differences writes down the specifications of the new format and send it over to the different labs and tools for approval and suggestions. It is important to show that in nibabel we have done the required work to study all or at least most that is out there and that the initial effort is coming out with some consensus of some sort. I hope @frheault that you will accept to lead such a task. And also thank you for your tremendous effort to make some sense in this world of tracks. Of course we will need the help of all of us especially @effigies, @matthew-brett and @MarcCote. But I think you are the right person to finally get this done and move on happily as a community.

neurolabusc commented 4 years ago

@Garyfallidis, I agree with your view, that formats face a Darwinian selection, and therefore popular formats are filling a niche. However, your comment that If your claim is to just use tck then the answer is that many labs are not satisfied with the tck format. If they were fine then we would just use tck is the fallacy of the converse. Just because popular formats are useful does not mean that unpopular formats are not useful. Consider the as-yet-not-created format advocated by many in this thread: it is currently used by no one, yet that does not mean it can not fill a niche. It could be that people simply use an inferior format because their tool does not support a better format, the better format is not well documented, or they are not aware of the advantages of a better format. I think we want a discussion of the technical merits of the available formats and the desired features for a format. @jdtournier provides a nice list of metrics to select between formats.

@jdtournier the challenge I have with tck is that I can not find any documentation for it. My support for this format was based on porting the Matlab read and write routines. It is unclear if these fully exploit the format as implemented by mrview and other MRTrix tools.

neurolabusc commented 4 years ago

@jdtournier per your comment Another issue that hasn't been discussed so far is the possibility of storing additional per-streamline or per-vertex information, in TRK-speak these are properties (per-streamline) and scalars (per-vertex). Several comments have noted this as a benefit for the TRK-format. I take your point that voxelwise images (NIfTI, MIF) can provide an alternative method to compute many per-vertex measures, but also @Garyfallidis concern that these can not encode all the measures we might want (e.g. curvature).

Maybe I am naive, but when I explored TrackVis, I thought there was a way to save TRK files that would map MNI world space without knowing the dimensions of the corresponding voxel grid:

vox_to_ras = [1 0 0 0.5; 0 1 0 0.5; 0 0 1 0.5; 0 0 0 1]
voxel_order = 'RAS'
image_orientation_patient = [1 0 0; 0 1 0]
invert_x = 0
invert_y = 0
invert_z = 0
swap_xy = 0
swap_yz = 0
swap_zx = 0

As I recall, I tried this with TrackVis with several artificial datasets that tested the alignment, and this seemed like an unambiguous way to map files nicely.

From the discussion so far, I still see TRK as the leading format available using the metrics of @jdtournier. I concur with @frheault that regardless of format, one of the core issues is explicitly defining the spatial transform.

So my question is, what problems do people have with TRK, and what new features does the field need? Can improved compliance, documentation and perhaps tweaks allow TRK to fulfill these needs?

Garyfallidis commented 4 years ago

@neurolabusc the main problem is speed for the TRK. It take long time to load/save big files. But there are also others. For example limitations on what parameters can be saved etc. @MarcCote and @frheault can you explain?

Garyfallidis commented 4 years ago

Another issue is accessing specific parts of the file. Currently you there is no support for fast access of specific bundles or parts of the tractogram. Another issue is memory management. The trk does not have support for memory mapping or similar. Some of these files are getting too large to load fully in memory and for some applications it is better to keep them in a memory map.

francopestilli commented 4 years ago

Hi Folks. I support the comments @Garyfallidis reported above. As the size of the tractography increase, we need to use a file format that allows partial loading of the data (say percentages of the streamlines).

jdtournier commented 4 years ago

@jdtournier the support of tck is already available in nibabel and dipy. If your claim is to just use tck then the answer is that many labs are not satisfied with the tck format. If they were fine then we would just use tck.

OK, obviously my vague attempts at humour have not gone down well. The main point of my message was to provide a list of the criteria that I would consider important for a standard file format. They happen to be mostly embodied in the tck format, perhaps unsurprisingly, and I'm being upfront about the fact that this is likely to be perceived as a conflict of interest - which clearly it has anyway.

I'm not arguing that tck should become the standard, and clearly the fact that there's a discussion about this means that at least some people don't think it should be either. That's fine, but since I've been invited into the discussion, I thought I'd to state my point of view as what such a format should look like. And yes, I have a problem in articulating that without looking like I'm arguing for the tck format, precisely because the considerations that went into its original design 15 years ago are still in my opinion relevant today.

Nonetheless, if you look into the current implementation you will see that the tractograms are always loaded in world coordinates.

But that's a matter of the software implementation, not the file format, right? Perhaps I'm getting confused here, but if we're discussing a new standard file format for tractography, then it should be independent of any specific software implementation or API. This discussion is taking place on the nibabel repo, which is perhaps why we're getting our wires mixed up. I don't wish to belittle the massive efforts that have gone into this project, but I'd understood this discussion to be project-independent.

But the advantage here is that you could have those stored in a different original space in the format.

I understand that, and I can see the appeal. I can also see the overhead this imposes on implementations to support multiple ways of storing otherwise equivalent information. This is why I would argue, on the grounds of simplicity, that a standard file format should adopt a single, standard coordinate system. Otherwise we'll most likely end up with fragmentation in what the different packages support: some will only handle one type of coordinate system because they haven't been updated to support the others, and will hence produce files that other packages won't be able to handle because they only support a different coordinate system. We could of course mandate that to be compliant, implementations should support all allowed coordinate systems, but I don't think this is necessarily how things would work out in practice. And we can provide tools to handle conversions between these so that these different tools can interoperate regardless, but I'm not sure this would be a massive step forward compared to the current situation.

On the other hand, I appreciate that different projects use different coordinate systems internally, and that therefore picking any one coordinate system as the standard will necessarily place some projects at a disadvantage. I don't see a way around this, other than by your suggestion of allowing the coordinate system to be specified within the format. I don't like the idea, because this means we'd effectively be specifying multiple formats, albeit within the same container. But maybe there is no other way around this.

As about storing other metrics I think we still need that information because a) many labs use such a feature, b) if you store the data on other files then you always have to interpolate and perhaps the interpolation used is not trivial.

OK, there's a misunderstanding here as to what I was talking about. First off, no argument: the many labs that need these features include ours, and we routinely make use of such information. But we don't store it as regular 3D images, that would make no sense in anything but the simplest cases. It wouldn't be appropriate for fODF amplitude, or for any other directional measure, or curvature, as you mention.

What I'm suggesting is that the information is stored as separate files that simply contain the associated per-vertex values, with one-to-one correspondence with the vertices in the main tractography file, in the same order. This is what we refer to in MRtrix as a 'track scalar file' - essentially just a long list of numbers, with the same number of entries as there are streamline vertices. We routinely use them to encode per-vertex p-value, effect size, t-value, etc. when displaying the results of our group-wise analyses, for example.

We also use separate files for per-streamline values (used extensively to store the weights for SIFT2), and these are also just a long list of numbers, one per streamline, in the same order as stored in the main file - and in this case, stored simply as ASCII text.

I'm not sure the specific format we've adopted to store these values is necessarily right or optimal in any sense, I'm only talking about the principle of storing these associated data in separate files, for the reasons I've outlined in my previous post: in my opinion, it's more space-efficient, more transparent, and more flexible than trying to store everything in one file.

I should add that storing the data this way does introduce other limitations, notably if the main tractography files need to be edited in some way (e.g to extract tracts of interest from a whole-brain tractogram). This then requires special handling to ensure all the relevant associated files are kept consistent with the newly-produced tractography file. This type of consistency issue is a common problem when storing data across separate files, and I'm not sure I've got a good answer here.

In any case, I've set out my point of view, and I look forward to hearing other opinions on the matter.

frheault commented 4 years ago

@neurolabusc I think the problem of TRK is related to efficiency when it comes to large datasets. Selecting only a subset is not optimal (especially if you want a random one), reading is slow, controlling the size of the float (float16/float32/float64) is not possible. When doing connectomics it is impossible to have a hierarchical file that allows the saving of streamlines connecting pair of regions (same logic for bundles), and a personal one is that the header is too complex for most people.

@Garyfallidis Despite all the flaws of tck/trk/vtk people have been using it for more than a decade, I think a first iteration should be as simple as possible. A hierarchical hdf5, readable by chunk using data/offset/length (you read offset/length and then know what data to read, then you reconstruct streamlines as polyline), that can append/delete data in-place, support data_per_streamline and data_per_point (and data_per_group if it is a hierarchical hdf5) with a statefulTractogram compliant header, with a strict saving/loading routine to prevent error.

@jdtournier I don't know if you are familiar with the data/offset/length approach in the ArraySequence of Nibabel. But it is a very simple way to store streamlines in 3 arrays of shape NBR_POINTx3, NBR_STREAMLINE, NBR_STREAMLINE, which I have used in the past with memmap and hdf5 to read quickly specific chunks or do multiprocessing with shared memory. Reconstructing it into streamlines is efficient since the point data is mostly contiguous (depending on the chunk size)

Bonus, I think hdf5 can be specified its own datatype for the array, so using float16 could be achieved and so reducing the size of file. Also matlab and c++ have great hdf5 libraries to help with the reading.

Finally, I agree that storing metrics per point would make an even bigger tractogram, but allowing the use of data per point and per streamline will likely facilitate the live of a few while the other can simply do it their way. I also agree that the way data should be written on disk should be world space (rasmm) as tck, that should be the default. But have the info to convert to tck/trk easily an so on. Leaving compatibility intact for a lot of people.

Your list Simplicity, Space efficiency, Load/store efficiency, Independence, Extensibility is crucial to think about. I think the header would be a much more simple than trk, but slightly more info than tck. I would go for the 4-5 attributes I mentioned earlier, that would be a sweet spot between Simplicity and Independence. As for Extensibility, since hdf5 is basically like a gigantic hierarchical dictionary as long as the mandatory keys are there. Adding more data could be done easily, more header info or even keeping track of processing would be possible (if wanted) like in .mnc file.

However, except for the switch to float16, I think reading/writing is kind of bound to its current limit. Supporting chunk read/write or on-the-fly is nice, but that would not change the speed of writing/reading of a whole tractogram.

jdtournier commented 4 years ago

@jdtournier the challenge I have with tck is that I can not find any documentation for it.

That's unfortunate, it's documented here. If you'd already come across this page but found it insufficient, please let me know what needs fixing!

neurolabusc commented 4 years ago

@Garyfallidis I agree TRK is inherently slow to read. Interleaving the integer "Number of points in this track" in the array of float vertices is a poor design. Much better performance could be achieved if the integers were stored sequentially in one array and the vertices were stored in their own array. One could load the vertices directly to a VBO. This would also allow fast traversal of the file, addressing your second criticism. Both would improve the Load efficiency metric.

@frheault adding support for float16 would improve Space efficiency and Load efficiency metrics. Not sure the use case for float64 for vertices, but would be nice for scalars. I also take your point that the header and spatial transforms could be simplified, improving the Simplicity metric.

While hdf5 has some nice wrappers for some languages, the format itself rates very poorly on the Simplicity metric. I think there are clear criticisms of the complexity. This would introduce the some of the same complications as TRAKO, without the space efficiency benefits of TRAKO. It is odd to criticise the TRK format as complex when it is simply described on a short web page, and then advocate the HDF5 format.

@jdtournier thanks for the Documentation. So the Matlab read/write reveal the full capability. TCK is a minimalistic, elegant format that certainly hits your Simplicity metric, but I can see why some users feel it is too limited for their uses.

This sounds like real progress is being made on the features that are desired, and worth the cost of implementing a new format.

francopestilli commented 4 years ago

One additional comment here @MarcCote , @frheault and @jchoude

what is the status of this work? https://www.sciencedirect.com/science/article/abs/pii/S1053811914010635

Should we discuss that here?

frheault commented 4 years ago

@neurolabusc I think the 'complexity' of trk is from the fact that there is a thousand ways to mess up the spatial transformation. Half of the trk I see have an identity affine rather than the affine of the nifti it was generated from. And numerous elements of the header are simply ignored in various software. But I personally think it is fairly simple ONCE you learn how to write the header properly.

But conceptually and the way data is written, yes trk is fairly simple. tck is even simpler. The suggestion for an hdf5 is, of course, a lot more complex, but this is required (more complexity, not hdf5 per se) if we want to support chunk/partial reading/saving and the possibility to split a tractogram in subcomponent (connectomics, bundles, clusters, etc.) I would never use an hdf5 to replace trk/tck if it was doing a similar thing.

But I think it is necessary for the added functionality. An hdf5 or some kind of hierarchical memmap (I need to investigate that further) is required (I believe) for the requested partial loading/saving and subcomponent division.

@francopestilli 90% of the compression was achieved using linearization, which is compatible with any format supporting polyline (trk/tck/vtk). I think it is not necessary to include fancy compressing in the file format when 90% of the reduction comes from remove (almost) colinear points. This was mentioned in https://www.frontiersin.org/articles/10.3389/fninf.2017.00042/full

I am personally sold to the trk (or the combo tck/nifti), but to be fair in our lab we never go above 10M streamlines (that are compressed, so they take up as much space as 1M) and we do not do visualisation until bundles (or at least a lot of filtering), so this was not a issue. Until connectomics, which force me to create a hdf5 like I described earlier.

neurolabusc commented 4 years ago

@frheault your suggestion for features seems compelling:

Improve spatial transforms of TRK
Improve extensibility (including the desired but limited features of TRK, e.g. the property_name and scalar_name fields are both limited to just 10 items).

I would strongly discourage the use of HDF5. While the format might seem simple with the wrapper you are using, it will hinder adoption to other environments. For performance, I would keep the raw data format as close to the GPU internal storage as possible, so to load the whole data you can bind it straight to a VBO. HDF5 does not improves chunk/partial reading versus a clean datatype that can be read directly from disk to memory. The key: value pairs of TCK provide a nice way to provide Extensibility without resorting to HDF5. I find the criticism by Cyrille Rossant compelling because their team embraced HDF5 for years. A specification that requires 150 pages provides a lot of opportunity for different implementations. I have spent too much of my career supporting interpretations of the complicated DICOM standard to feel comfortable with such a complicated standard for such a simple format where there seems to be so much agreement regarding the features we want.

I would advocate a specification like my mz3 format, which tries to emulate the minimalist style of Paul Bourke:

Entire format described in a single web page.
Format disk memory maps 1-to-1 to data storage of GPU and statistics.
Example validation files in the format that demonstrate the features of the format.
Example code in popular languages, e.g. JavaScript, Python (Blender Plugin) and Matlab that demonstrate the format.
Links to open source projects in other languages that support the format.

jdtournier commented 4 years ago

TCK is a minimalistic, elegant format that certainly hits your Simplicity metric, but I can see why some users feel it is too limited for their uses.

🤣 yes, it could be argued that I value Simplicity a bit too much...

@jdtournier I don't know if you are familiar with the data/offset/length approach in the ArraySequence of Nibabel. But it is a very simple way to store streamlines in 3 arrays of shape NBR_POINTx3, NBR_STREAMLINE, NBR_STREAMLINE, which I have used in the past with memmap and hdf5 to read quickly specific chunks or do multiprocessing with shared memory. Reconstructing it into streamlines is efficient since the point data is mostly contiguous (depending on the chunk size)

I wasn't, but that sounds like a perfectly sensible way to arrange the data for processing. Though I'm not sure it needs to be part of the file format itself. But it's certainly appealing for random access, etc. And as @neurolabusc mentions:

Much better performance could be achieved if the integers were stored sequentially in one array and the vertices were stored in their own array. One could load the vertices directly to a VBO.

This would be a distinct advantage for rapid visualisation.

However, there is one issue with this way of storing streamlines on file: it's great for reading, but not so appealing for writing, particularly when appending to the data on the fly, since you need to grow multiple arrays concurrently. So the application will need to know in advance how many streamlines it will be generating, or it will need to hold all the data in memory until that information is known before committing to disk, or come up with some other mechanism to deal with this. This may not affect all implementations equally: in MRtrix, most streamline processing (including the tracking process itself) uses a streaming approach, we never (or very rarely) need to hold all of the data in RAM concurrently. So streamlines data are committed to file fairly regularly (in >16MB chunks by default), without necessarily knowing how many will eventually end up in the file. This allows these applications to process very large files with a small memory footprint (it also brings the advantage that processing can be interrupted without invalidating the data that has been committed so far). It would be difficult to achieve this if multiple arrays need to be appended to concurrently.

But I get the fact that this would allow rapid random access to arbitrary streamlines without the need to read a lot of data from (potentially very slow) storage unnecessarily. I'd like to find a solution that allows both random access and the kind of stream processing we routinely use - but I'm not sure it's going to be simple to achieve without complicating the format more than it needs to be...

effigies commented 4 years ago

Not doing diffusion, this is all pretty abstract to me, but I agree that random access + streaming appends is going to be hard to achieve in a single file format without some kind of allocation table or B-tree, which would require a coordination layer along the lines of libhdf5. I see two approaches that stop short of HDF5:

1) A good random access format with optional streamed frames appended. This could be defragmented offline, to ensure efficient access to the newly appended data.

2) Two complementary formats. The metadata would be identical, but the data layout would be optimized for reading or streaming.

Both of these would define more of an object protocol and have two ways to fetch a given piece of data. In case (1), you can quickly determine whether you are able to look up the requested data in a large, initial frame or follow a chain of appended frames. In case (2) the fetch strategy depends on which format you're looking at.

effigies commented 4 years ago

On reflection, I'lI add a variation on option 1:

Split the data and header into two separate files, append frame headers to the header file and frame data to the data file. The need to defragment would probably remain, but the headers would not be intersperse within the data array. And accessing data might require reading a chain of frame headers, but you're likely to be able to do it with fewer cache misses.

frheault commented 4 years ago

Question for people using Matlab or c++ in general, is anyone aware of libraries in their own language that 'easily' support the Advanced Scientific Data Format (ASDF, https://asdf.readthedocs.io/en/2.7.0/)?

I think for c++ it is ok (https://github.com/asdf-format/asdf-cpp), but there is nothing in Matlab directly. It is a file format mentioned by Cyrille Rossant on his blog as a superior alternative to hdf5, and seems to be considered an improvement over older iteration of the similar framework.

It supports the dictionary-like structure (easy to extend using keys/tags, and fairly easy to use), allows for hierarchical datasets (if desired), allows for reading by chunk (memmap), allows for writing on the fly (using Stream).

I have not compared in-depth, but it does seem to overcome some limitations of hdf5 mentioned earlier while providing an easy framework to have an equivalent to trk/tck that can be scaled to large datasets or divided into hierarchical subcomponent. The header/metadata/data can be read independently of course.

Before doing in-depth comparison I was wondering if this kind framework made sense to people. Without being that one specifically, I wanted to know if this was the kind of framework I should look for. In less than an hour, I prototyped a on-the-fly writer for streamline, random individual streamline getter, hierarchical division (I save 30 bundles into a single file and then re-saved them separately)

Of course, a pretty significant amount of work would be needed to wrap such a framework for tractography, since this was made for another field (https://www.sciencedirect.com/science/article/pii/S2213133715000645), but I think such framework would be interesting with the majority of the work already done.

PS : For the one familiar with the StatefulTractogram, this would easily be compliant with what we are used to. However, in the case of hierarchical division a wrapper would have to know how to "reconstruct" the component before doing the processing (same logic for a conversion to tck/trk/vtk). Fairly easy to plan, just important to know.

neurolabusc commented 4 years ago

@effigies I do worry about file formats that use multiple files (e.g. file.hdr/file.img). It makes it easy for a user to accidentally lose one file from a pair or rename them so the association between files is lost. A further issue is with macOS (and perhaps future operating systems that develop sandboxing). With macOS when a user selects the file 'file.hdr' via drag-and-drop or the open file dialog, the application does not necessarily have the permission to read 'file.img', as the user did not explicitly associate that file with the invoking application (as defined by the File System Access). This is a concern I also raised with the BIDS sidecars. However, the BIDS sidecar leveraged the existing NIfTI format. Here we are discussing a unique format from scratch. While it is a clever idea for a program to write multiple temporary files while it generates tracks, so each can be written to independently, I am would not be in support of this as a format (and also not sure if there is a disk contention penalty). In my proposed format (below), if a developer wanted to pursue this approach, they would simply concatenate the files when done with streamline creation.

@frheault and @Garyfallidis can you share some datasets (e.g. Google Drive, DropBox, etc) that illustrate the slow nature of TRK. I suggest we quantify the penalty. If the penalty for reading a file where the indices are interleaved with the vertices is small relative to when they are blocked, it would provide support for @jdtournier method to allow fast writes where the number of streamlines is unknown. I wonder if this penalty is larger for languages like Matlab and Python relative to compiled languages.

I recognize there may be a benefit for storing indices in the same stream as vertices for writing during streamline generation. However, I would suggest that while this might be useful as an internal format for stream generation (e.g. TCK), we would still want a format that optimizes fast reading and random access. My sense is streamline generation will be done once when the file is created. On the other hand, fast reads will benefit subsequent reads for the many other steps applied to the file (visualization, adding properties, adding scalars, computing statistics). My vote would be for a format that look like this:

 [extensible header]
 [streamline lengths NBR_STREAMLINE integer]
 [vertex position NBR_POINTx3 float ]
 [(optional) per-streamline property 0 NBR_STREAMLINE float or integer]
 [(optional) per-streamline property 1 NBR_STREAMLINE float or integer]
...
 [(optional) per-vertex scalar 0 NBR_POINT float or integer]
 [(optional) per-vertex scalar 1 NBR_POINT float or integer]
...

effigies commented 4 years ago

@neurolabusc To be clear, option 3 was not suggesting many files; just two. The header file would be a concatenation of frame headers and provide indices into the data file, which can also be appended to. But your point is taken with the brittleness of multi-file approaches.

I agree that a benchmark dataset would be very useful for focusing the discussion.

frheault commented 4 years ago

@neurolabusc a file format written in such a way makes it impossible to read a chunk (except at the beginning) or to allows for hierarchical division.

I think that trk and tck will remain vastly use for a long time, for most use case I believe their functionality and speed is more than enough. However, I think that the 'fancy' big data feature mentioned at the beginning of the thread would require a new way of writing tractography data. Outside of these niche applications, I think trk/tck will prevail another decade 😄

neurolabusc commented 4 years ago

@frheault the format I propose has fixed size for each section, so you have fast random access to read data as blocks if the desired data is saved contiguously. This fully exploits the caching of modern IO. Consider a volume with 100 streamlines, and property 0 defines fibers 0-33 to be part of bundle 1, 33-40 to be part of bundle 2, etc. You can now read these. contiguous segments from the vertex position and per-vertex scalars.

Perhaps I am with @jdtournier in valuing simplicity too much. It sounds like what you desire exists in TRAKO, it provides fast reads and efficient storage, leveraging existing well-tested, open source, clever but necessarily complex formats (Draco compression, glTF). What are the technical reasons for preferring HDF5 over TRAKO? My own view is that TRK is very close to what people said they want, so we should design something that is similarly simple but addressing its weaknesses.

Can you elaborate on the 'fancy' big data feature mentioned at the beginning of the thread. Based on all the comments so far, I was thinking we could use something similar to an extensible TRK with an improved description of the spatial coordinates. My assumption was largely based on your previous comment I personally think TRK is fine, everything is in the header. I am not trying to be argumentative, I just have not heard the requirement that requires complicated when simple seems to address everyones concerns.

bjeurissen commented 4 years ago

I am a huge proponent of storing the tractogram information in world coordinates.

I really feel a tractogram should be meaningful by itself without having to rely on a corresponding reference image.

Loading an anatomical data set and a tractogram, only to find out that the tracks were saved in voxel coordinates and can't really be used without the correct reference image, can really ruin my day :)

It is so convenient to have a single tractogram file that has no bearing with any reference image and load it on top of whatever contrast you want: anatomical, diffusion, MRI, cropped versions of your images etc...

Any format that stores tracks in voxel-coordinates will require more files (reference + tractogram + target) to accomplish that unless that format would also store the reference header info inside the tractogram. But that kind of beats the purpose and a much more simple and elegant solution would be to store the tractogram directly using world coordinates (in which case you only need tractogram + target).

For software that relies on the availability of streamlines in voxel coordinates, those can always easily be obtained by moving the world coordinates (from the tractogram) through the world-to-voxel transform of the target (which is always available).

Garyfallidis commented 4 years ago

@bjeurissen and all it is easy to see why this would be a bad idea even in world coordinates. Just make a 3d viewer and try to change the tractogram from neurological to radiological convention coordinate system in world coordinates. Then you will see that you really need information from the reference image to move from one coordinate system to the other. That could be easily be saved in the track format. Bat if you do not have that information you simply cannot do the transformation correctly.

matthew-brett commented 4 years ago

@Garyfallidis - I'm never quite sure what people mean by 'neurological' and 'radiological' - https://nipy.org/nibabel/neuro_radio_conventions.html - can you explain more?

bjeurissen commented 4 years ago

@bjeurissen and all it is easy to see why this would be a bad idea even in world coordinates. Just make a 3d viewer and try to change the tractogram from neurological to radiological convention coordinate system in world coordinates. Then you will see that you really need information from the reference image to move from one coordinate system to the other. That could be easily be saved in the track format. Bat if you do not have that information you simply cannot do the transformation correctly.

I fail to see why this would be a problem... Those kind of problems are exactly those that would stem from using voxel coordinates instead of world coordinates.

Garyfallidis commented 4 years ago

Ben, I mean that you need to know the volume_sizes to make these transformations from RAS to LAS (or LPS) even in world coordinates. Otherwise the transformations will not be accurate.

I hope Ben when we say world coordinates we mean the same thing. In world coordinates the origin is somehow in the center of the brain. Do you agree? Now the question is do you keep volume size either in world or native coordinates or you completely discard this information?

Also we should think about the issues with interpolation between native and world coordinates. Some users may want to see their data at their original coordinates without being interpolated. Especially for methods development this make sense especially when you have datasets with anisotropic voxel size.

Garyfallidis commented 4 years ago

Let me say explain that if the origin in world coordinates it is not exactly on the center of the image which it will not be, you need to volume dimensions to translate correctly the image after any flip.

jdtournier commented 4 years ago

I think there's confusion with what is meant by various terms. Even radiological/neurological is difficult to nail down exactly - for instance, in FSL, they often use the term to denote a left or right handed coordinate system respectively (I think I've got them the right way round...), which is basically what the qfac term in NIfTI encodes (the sign of the determinant of the transform). It's clear enough when talking about LAS or RAS images, but it gets messier (and much less intuitive) when talking about images acquired as sagittal or coronal slice stacks. What is clear is that a world->voxel transformation is always relative to the image under consideration (whether the latter is stored 'radiological' or 'neurological'), and for that you will indeed always need the information about the reference image, including as you say the image volume sizes.

But as far as I am concerned, the terms have no bearing on world coordinates: these are well-defined and completely independent of neurological/radiological viewing conventions. The way I see it, radiological/neurological refers to the orientation of the camera relative to the object being shown, and as such is a property of the software being used to display the data. For neurological, the camera looks at the subject from behind (the subject's own point of view) so that patient left is camera left. For radiological, the camera looks at the subject from the front (the physician's point of view). The main point here is that the position of the camera (i.e. radiological/neurological) should be considered as entirely independent of the convention used to store the data. In fact, I would go further and suggest that the terms radiological/neurological should not be used in the context of data storage - they introduce too much scope for confusion (e.g. what does it mean for an image acquired as a stack of sagittal slices if it's stored as 'neurological' or 'radiological'? How come I can display a 'neurological' image assuming a radiological convention? And so on - these have all come up in discussions I've had with users when trying to explain these things).

So in short, I don't see any circumstances where a transformation between radiological and neurological world coordinates should be considered a legitimate operation. But I might be missing something here, please clarify if I've got the wrong end of the stick...

matthew-brett commented 4 years ago

Is there really a good case for enforcing that the points be stored in real coordinates? This could be inefficient in space, if, for example, an int16 could store the voxel coordinates exactly. What is the argument against at least allowing storage as voxel coordinates, with an associated affine, and possibly a bounding box?

jdtournier commented 4 years ago

I'd like to second @neurolabusc's rough outline for a format in this post - this is more or less what I would have in mind here. It's easy enough to explain the format directly, without relying on 3rd party libraries to handle its internals. It can readily be memory-mapped (though I would additionally take steps to ensure word alignment if possible), and if the initial header/index sections are sufficient, allows for rapid random-access.

I would add a couple of changes though:

I'd store the offsets to the streamlines data, rather than their lengths. Lengths can trivially be worked out from consecutive offsets, but offsets can only be worked out by a cumulative sum.
I'd make sure there was clear indications about what additional information was present in the data set: what type (per-streamline or per vertex), and a human-readable label.

On that last point: the additional data information could be stored in the initial header, but I would propose as an alternative that all the information about the additional data (as well as the additional data themselves) be stored as as semi-independent chunks appended to the file. Each of these chunks would consist of:

a type indicator (per-vertex/streamline)
its human-readable label
the data array.

This would allow 'simple' readers to stop as soon as they've read all the streamlines with no modifications. More advanced readers would only need to check for the presence of additional chunks if the file size is less than the offset to (the end of) the last streamline, and could then trivially enumerate and handle the additional data - all with relatively minimal IO since there should be enough information in each chunk's (small) header to seek to the next one, with no need to load the data themselves. This would for instance allow appending additional bits of information to an existing tractography file with no modification to its original contents (a huge plus in my opinion).

It still wouldn't allow data streaming, but I really can't see how we can marry streaming with random access without making things way more complicated than is reasonable. I think we'll need to come up with a different standard of our own for streaming. In case anyone is interested, what we're currently discussing within the MRtrix team is the possibility to stream data of this nature through Unix pipes, which means we'd need to be able to pass an individual streamline's per-streamline and per-vertex values concurrently (so the current format suggestion wouldn't work at all). But realistically, that constitutes an IPC protocol rather than a file format as such, so it's best left out of this discussion...

jdtournier commented 4 years ago

Is there really a good case for enforcing that the points be stored in real coordinates? This could be inefficient in space, if, for example, an int16 could store the voxel coordinates exactly. What is the argument against at least allowing storage as voxel coordinates, with an associated affine, and possibly a bounding box?

Maybe I'm getting the wrong end of the stick here, but I've not come across any modern tractography package that stores streamline vertex positions as integer voxel indices (?). Even if stored in voxel coordinates, everyone uses floating-point anyway, right...?

nipy / nibabel

Tractography Data Format #942