Memory issues when resampling large 4D files

oesteban commented 1 year ago

The resampling of transform-mappings is typically killed in standard settings when files are large (we've seen it with DWIs of ~150 volumes).

oesteban commented 8 months ago

@effigies - would you agree that:

We want to open only one volume of the reference image and then make sure it is cleared out from memory
We want to open the moving image with memory mapping
We want to resample volume-by-volume (or by N-length 4D chunks), so that the full file is not required in memory at once.

oesteban commented 8 months ago

For 1, I think if we open memory-maping it should be sufficient as we don't access the data array, correct?

https://github.com/nipy/nitransforms/blob/1674e86a73595356eb6a775fd5b5c612952482a0/nitransforms/linear.py#L440

For 2, the moving image is loaded here:

https://github.com/nipy/nitransforms/blob/1674e86a73595356eb6a775fd5b5c612952482a0/nitransforms/linear.py#L447

and then this is the relevant loop where we should make sure only one volume is loaded on to memory at a time:

https://github.com/nipy/nitransforms/blob/1674e86a73595356eb6a775fd5b5c612952482a0/nitransforms/linear.py#L466-L476

does it make sense?

effigies commented 8 months ago

1) You do not need to load a volume of the reference image at all, unless you're using a mask to narrow the field-of-view. All you need is the shape, affine and dtype. 2) If you switch to using dataobj, then it doesn't matter whether you have a mmap or not. You can create an iterator that returns one volume at a time:

```python
(dataobj[:, :, :, i].astype(dtype, copy=False) for i in range(dataobj.shape[3]))
```

This could be made into a generator function if you need more flexibiilty.

3) Volume-by-volume definitely makes sense. Let the thread manager choose how many to load, and then set your threads based on available cores and memory.

oesteban commented 8 months ago

Thanks SO MUCH :)

effigies commented 8 months ago

I think you have a serious inefficiency here: https://github.com/nipy/nitransforms/blob/1674e86a73595356eb6a775fd5b5c612952482a0/nitransforms/linear.py#L464-L478

You're creating a bunch of individual volumes, and then concatenating them in C order. There are going to be unnecessary copies under the hood, and another one when saving to NIfTI.

In fMRIPrep, I've moved to

    # Order F ensures individual volumes are contiguous in memory
    # Also matches NIfTI, making final save more efficient
    out_array = np.zeros(coordinates.shape[1:] + data.shape[-1:], dtype=output_dtype, order='F')

    tasks = [
        asyncio.create_task(
            worker(
                partial(
                    resample_vol,
                    data=volume,
                    coordinates=coordinates,
                    pe_info=pe_info[volid],
                    hmc_xfm=hmc_xfms[volid] if hmc_xfms else None,
                    fmap_hz=fmap_hz,
                    output=out_array[..., volid],
                    order=order,
                    mode=mode,
                    cval=cval,
                    prefilter=prefilter,
                ),
                semaphore,
            )
        )
        for volid, volume in enumerate(np.rollaxis(data, -1, 0))
    ]

The equivalent here would be:

            resampled = np.zeros(coordinates.shape[1:] + data.shape[-1:], dtype=output_dtype, order='F')
            for t in range(data.shape[-1]):
                ndi.map_coordinates(
                    data[..., t],
                    targets[t, ..., : _ref.ndim].T,
                    output=resampled[..., t],
                    order=order,
                    mode=mode,
                    cval=cval,
                    prefilter=prefilter,
                )

oesteban commented 8 months ago

Okay, I'm testing your suggestion, and before even getting there, we hit these lines:

https://github.com/nipy/nitransforms/blob/1674e86a73595356eb6a775fd5b5c612952482a0/nitransforms/linear.py#L452-L455

These push RSS from ~10GB to ~30GB.

effigies commented 8 months ago

I don't fully understand that line. We don't use it in sdcflows/fmriprep. We would just use:

targets = nt.base.SpatialReference.factory(spatialimage).ndcoords.astype('f4').T

effigies commented 8 months ago

Ah, hold on. Nevermind, I see the equivalents. Need to think it through a bit.

oesteban commented 8 months ago

I think that's somehow equivalent to _ref.ndcoords.T of line 452.

Lines 453-455 converts those physical coordinates into voxel coordinates - seemingly in a very inefficient way

effigies commented 8 months ago

What's the dimensionality of your target image?

effigies commented 8 months ago

And here's the equivalent of what we have in fMRIPrep:

ycoords = self.map(_ref.ndcoords.astype('f4').T)
ras2vox = ~nt.Affine(spatialimage.affine)
targets = ras2vox.map(ycoords)

If switching to float32 and Affine.map don't help, then you're just working with much bigger data. In which case it might make sense to start chunking arrays by slice as well as volume.

oesteban commented 8 months ago

>>> _ref.shape
(76, 102, 60)

Not so big, although it's 750 timepoints, so

>>> targets.shape
(750, 465120, 3)

effigies commented 8 months ago

Timepoints shouldn't matter at this point. You're just calculating the spatial coordinates, not iinterpolating values. Even at float64, targets should only be 10MB, so something's going wildly wrong.

I would skip the index() thing anyway. I don't think it buys you any clarity.

oesteban commented 8 months ago

Okay, timepoints do matter because here:

targets = ras2vox.map(ycoords)

you have to ravel the two first dimensions of targets.

oesteban commented 8 months ago

Looks like cramming all the coordinates in a single mapping is really not memory efficient.

oesteban commented 8 months ago

Okay, that was the culprit (probably the inefficiency you pointed out was also blowing up stuff). Sending a PR soon.

nipy / nitransforms

Memory issues when resampling large 4D files #173