chunked performance - Githubissues

ivanov commented 5 months ago

continued exploration and observations, from notes that began in #12

default automatic chunking not deterministic (and not performant)

One of the things we wanted to understand how the default auto-chunking performs. In the figures in #12 I was passing in a dictionary that composed with whatever chunking was already set up automatically, and unfortunately what I figured out is that the default auto-chunking is not deterministic. It would always automatically chunk the last ("time") dimension, but the chunk size varied between 61-257, usually going down by 10-20 on consecutive runs. Also, using just this chunk strategy without any additionally specified chunking typically yielded a slower run than the nibabel. For example, whereas the complete nibabel approach took 2.6 seconds using about 8 GiB, the default auto-chunking approach took 3.1 seconds and still used about 8 GiB.

splitting up "reading" and "compute" for the original `nibabel` code

Another thing I worried about was that in the earlier plots I wasn't being fair to the nibabel code, as it was reading the file, reshaping it, and then running the computation. So I separated this for the nibabel case and was surprised to find that just "reading" was already using up most of the time and the entire 8 GiB! I wrongly assumed that the memory growth was coming from matrix multiplicaiton, but it turns out that half of the memory usage is coming from just

# Reshape to time by (i * j * k) 
vols_by_voxels = np.reshape(data, (-1, n_vols)).T

Here are 9 runs where I perform the reshape and save all of the data, and one run where I save all of the same data but without performing the reshape:

new "compute" only nibabel figure

Doing the reshaping as part of the "reading" code leaves us with 1 second to read the reshaped data from disk and do the linear algebra. This can now be used to make apples-to-apples comparison with the chunked "computer" only xibabel code.

can we get anywhere with chunking?

In this new regime, some of the previous chunking that we used initially no-longer outperform the original nibabel computation in terms of time, though they still improve on the original nibabel computation memory usage. The two best chunking approaches we previously had ended up using only 2 GiB of memory, instead of the 4 GiB nibabel uses, but they take a a bit over a second, so actually end up slower than nibabel.

To get around the non-determinism of automatic chunking, I ended up specifying a chunk-dimension of -1 -- which says "do not chunk in this dimension" -- for all but one dimension, and then ran some experiments for small chunksizes in just that one dimension. Chunking only in the "time" dimension did not performed well. All of the performant strategies below have {"time": -1} which I omitted in the plot. I've also kept in the resident memory trace for the nibabel runs for continuity with the previous figure, but only report the total duration and maximum memory usage for the various chunking strategies.

{'i': -1, 'j': 3, 'k': -1, 'time': -1} Running time 0.46 s (SD=0.04), max memory 1.43 GiB (SD=0.09)
{'i': -1, 'j': -1, 'k': 1, 'time': -1} Running time 0.49 s (SD=0.06), max memory 1.56 GiB (SD=0.05)
{'i': 3, 'j': -1, 'k': -1, 'time': -1} Running time 0.50 s (SD=0.09), max memory 1.59 GiB (SD=0.05)
{'i': -1, 'j': 4, 'k': -1, 'time': -1} Running time 0.51 s (SD=0.08), max memory 1.63 GiB (SD=0.06)
{'i': 2, 'j': -1, 'k': -1, 'time': -1} Running time 0.58 s (SD=0.07), max memory 2.00 GiB (SD=0.05)
{'i': 1, 'j': -1, 'k': -1, 'time': -1} Running time 0.60 s (SD=0.08), max memory 2.14 GiB (SD=0.09)
{'i': -1, 'j': 2, 'k': -1, 'time': -1} Running time 0.74 s (SD=0.08), max memory 2.16 GiB (SD=0.06)
{'i': -1, 'j': 1, 'k': -1, 'time': -1} Running time 0.78 s (SD=0.08), max memory 2.43 GiB (SD=0.06)

matthew-brett commented 5 months ago

I wrongly assumed that the memory growth was coming from matrix multiplicaiton, but it turns out that half of the memory usage is coming from just

Maybe this comes from the proxy image in Nibabel : https://nipy.org/nibabel/images_and_memory.html

In that case, the initial np.reshape will implicitly call np.array on the image data proxy, which will then cause the image data to be read from disk into memory. So my prediction would be that it's np.array(data) that's causing the memory usage, and np.reshape has very little extra effect on memory or CPU time.

This in turn, I suppose brings up what "fair" means to Nibabel. Nibabel, like Xibabel / Xarray + Dask, will try to defer reads from disk - it's just it will do it in a cruder and less configurable way.

matthew-brett commented 5 months ago

For the time chunking - yes - that makes sense - I suppose that the matrix multiply may need whole voxel time-course rows to work with, so need to load across time - whereas the other dimensions will all allow smaller blocks of voxels. We need to think about some optional chunking strategy that may depend on the operation. For example - time slicing will likely benefit tasks like motion correction for which you need to pull out individual volumes. I guess we could simulate this kind of processing with:

for tp in ximg['time']:
    arr = np.array(ximg.sel(time=tp))

Maybe we can define processing steps with their associated chunking strategies.

matthew-brett / xibabel

chunked performance #16

default automatic chunking not deterministic (and not performant)

splitting up "reading" and "compute" for the original `nibabel` code

new "compute" only nibabel figure

can we get anywhere with chunking?

matthew-brett / xibabel

chunked performance #16

default automatic chunking not deterministic (and not performant)

splitting up "reading" and "compute" for the original nibabel code

new "compute" only nibabel figure

can we get anywhere with chunking?

splitting up "reading" and "compute" for the original `nibabel` code