nipy / nibabel

Python package to access a cacophony of neuro-imaging file formats
http://nipy.org/nibabel/
Other
657 stars 259 forks source link

Update slicer docs for indexed-gzip & keep_file_open #1058

Open ecc521 opened 3 years ago

ecc521 commented 3 years ago

I have some rather large (gzipped) NIFTI files I need to read without first buffering in memory (so reading in slices).

When loading the image via nibabel.load, slicer and dataobj appear to be re-reading from the beginning each time, resulting in quadratic time complexity with the number of slices taken.

Since I'm only interested in proceeding forward through the file, it would seem that the time complexity here should be linear - indeed, linear time complexity can be obtained by updating the code from an old question:

from io import BytesIO
from nibabel import FileHolder, Nifti1Image
from gzip import GzipFile
fh = FileHolder(fileobj=GzipFile(niftipath))
img = Nifti1Image.from_file_map({'header': fh, 'image': fh})

In this case, since GzipFile preserves the current decompression state, proceeding strictly forward in the file works at the expected speed (much faster).

Is there a way to obtain this same slicing performance using the nibabel.load() API (such as by passing a GzipFile, etc)? This would be greatly preferable, as it abstracts away file formats.

effigies commented 3 years ago

If you install the indexed-gzip package, you should get performance improvements for free.

ecc521 commented 3 years ago

Thanks! @effigies Looking at the indexed-gzip docs, I was able to find the flag - keep_file_open = True While indexed-gzip alone does help, that's all that is needed for this use case.

Still confused as to why keep_file_open is off by default, but enabling it seems to be a solution.

Unless the defaults or relevant documentation (see slicer section - no mention) needs to be revisited to make keep_file_open/indexed-gzip more visible, I'm good to close this.

effigies commented 3 years ago

With indexed gzip, you should not need to set keep file open to get almost identical performance.

The reason it's off by default is that, when working with many files, you can exhaust file handle quotas, and the lifetimes of file handles are difficult to reason about.

Definitely good to update the docs.