Very slow and memory hungry tractogram loader (TRK)

emanuele commented 4 years ago

At least for tractograms in TRK (Trackvis) format, in the case of a large number of streamlines (>1 million), loading streamlines with dipy.io.streamline.load_tractogram() is extremely slow and requires an unusual amount of memory. Example: for a tractogram of 10 million streamlines, loading takes:

15 minutes and approximately 10Gb of RAM with Nibabel
1 minute and approximately 5Gb of RAM with our custom, numpy-based, TRK loader (https://github.com/emanuele/load_trk.git)
6 seconds and 3.5Gb of RAM if we previously save the tractogram as a numpy array (numpy.save()) and then load it with numpy.load(). But in this case, the streamlines must be resampled to a fixed number of points.

grlee77 commented 4 years ago

if we previously save the tractogram as a numpy array (numpy.save()) and then load it with numpy.load(). But in this case, the streamlines must be resampled to a fixed number of points.

The following blog post has a suggestion for how to store ragged arrays (aka jagged arrays), that should fit the case where not all streamlines have the same number of points. It basically would involve concatenating all streamlines along the dimension that varies in length and recording the offsets to each streamline in a separate index list. These two arrays (concatenated and offsets) can then be stored in a single .npz file via np.savez. There is a concrete example in the following blog post: https://tonysyu.github.io/ragged-arrays.html#.XyRsJBF7kr4

If the overall memory use becomes a concern, it would be possible to use xarray + Dask to allow indexing into the concatenated array without having to load the whole thing in memory at once.

effigies commented 4 years ago

We currently have the ArrayProxy construct, which just records the offset, dtype and shape, and will pull data when requested. By a similar principle we could have an ArrayProxySequence which just pins multiple ArrayProxys to different offsets in the file, or an ArraySequenceProxy, which indexes the entire file and acts like an ArraySequence.

soichih commented 4 years ago

I am also seeing a very slow streamline loading performance for tck as well. It looks like this part of the code is severely CPU bounded

https://github.com/nipy/nibabel/blob/master/nibabel/streamlines/array_sequence.py#L32

emanuele commented 3 years ago

Since I was requested a couple of times, I added the link to download the test file/tractogram in the github repo of our fast TRK loader (https://github.com/emanuele/load_trk.git): https://nilab.cimec.unitn.it/people/olivetti/data/sub-100206_var-FNAL_tract.trk . It may be useful for benchmarking. And here is another, much larger, test tractogram (10M streamlines): https://nilab.cimec.unitn.it/people/olivetti/data/sub-599469_var-10M_tract.trk

MarcCote commented 3 years ago

Finally got some time around to finish my testing. I can't reproduce the same timing as you originally posted @emanuele. Here's the simple benchmark script I'm using:

import nibabel as nib

# filename = 'sub-100206_var-FNAL_tract.trk'
filename = 'sub-599469_var-10M_tract.trk'

S = nib.streamlines.load(filename)
print("Nb. of streamlines:", len(S.streamlines))
print("Checksum:", S.streamlines._data.sum())

Here's my timing for the 10M streamlines file you shared (4:25 min, RAM peak: 3.41 Gb):

/usr/bin/time -v python bench_trk.py
Nb. of streamlines: 10000000
Checksum: -1326157800.0
        User time (seconds): 260.52
        System time (seconds): 4.31
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:25.10
        Maximum resident set size (kbytes): 3576648

And with my new patch (see #1000), I get (1:40 min, RAM peak: 5.72 Gb):

/usr/bin/time -v python bench_trk.py
Nb. of streamlines: 10000000                           
Checksum: -1326157800.0               
        User time (seconds): 99.87
        System time (seconds): 3.56
        Percent of CPU this job got: 102%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:40.66
        Maximum resident set size (kbytes): 6001400

I'm not sure why the RAM peak is higher with the patch since it is supposed to do the affine transformation in-place :-/. However, we can easily chunk it to avoid RAM usage surges.

I do like like your two-pass approach that first maps out the streamlines, then loads only the streamlines needed. I could see this be integrated into the LazyTractogram.

neurolabusc commented 3 years ago

@frheault is there any way that @emanuele's optimization could be made into a pull request for nibabel. I know there is work on a new format, but TRK remains popular. Beyond performance, the code appears to also handle the TRK format better.

MarcCote commented 3 years ago

@neurolabusc @emanuele's code relies on NiBabel to read the TRK's header. Regarding speed, with PR #1000, it should be comparable.

neurolabusc commented 3 years ago

@MarcCote can I suggest you contact Steven Jay Granger who reported that he was unable to load a TRK file created by DSI studio using DiPy, though he was able to load it with @emanuele's patch. Even if the latest release has improved speed, providing full compatibility with this popular format is important.

MarcCote commented 3 years ago

@neurolabusc yes, I replied on the Discourse thread.

nipy / nibabel

Very slow and memory hungry tractogram loader (TRK) #943