proper chunked data loading

matthew-brett / xibabel

Piloting a new image object for neuroimaging based on XArray

BSD 2-Clause "Simplified" License

6 stars 0 forks source link

proper chunked data loading #12

Closed ivanov closed 5 months ago

ivanov commented 5 months ago

open_dataarray does not try to read all of the data up front, whereas load_dataarray
does. That still was insufficient for reduced memory footprint, the chunks='auto' argument is also required to have the dask array reading respect and preserve the engine (zarr) on-disk chunking

other minor improvements

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 75.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 76.26%. Comparing base (3509316) to head (489c1f7).

Files	Patch %	Lines
src/xibabel/loaders.py	66.66%	0 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #12 +/- ## ========================================== - Coverage 76.40% 76.26% -0.15% ========================================== Files 9 9 Lines 534 535 +1 Branches 74 75 +1 ========================================== Hits 408 408 Misses 110 110 - Partials 16 17 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

matthew-brett commented 5 months ago

Have you installed pre-commit? (I didn't until yesterday).

Any metrics for what difference this makes?

ivanov commented 5 months ago

sorry about the ruff error, I'll try to be more mindful of it.

Any metrics for what difference this makes?

Yes, for a 3.7 G image, the non-pathological chunking uses only about 2 G of ram when running on my 12 core / 16 thread machine (i5-1340P)

This can drop down much further if you limit the number of cores used - here's the same computational workload on a single core (takes longer, but way more memory efficient).

However, in the case of using a single CPU, the sharding to chunks still uses up more memory, but only about 1.5 Gig. Just running the analysis for the multi-cpu case now.

matthew-brett commented 5 months ago

Nice - is that really true, that the chunked i=16 value is fastest and uses least memory?

ivanov commented 5 months ago

Nice - is that really true, that the chunked i=16 value is fastest and uses least memory?

All of the reports are for a single run, so there are no error bars here, and I have not systematically explore the chunking space. Another caveat is that I'm pretty sure the chunking dictionary listed here got applied on-top any automatic chunking dask already did on initial array creation, which already made chunks along the "time" dimension. So the chunking I specified got composed with this initial chunking:

Here is the "overhead" of running the sharding itself. Surprisingly, but perhaps not in retrospect, running this on a single cpu is more efficient than trying to do it without that limit.