Memory spike when loading .mib data lazily

emichr commented 4 months ago

Describe the bug

with When loading large .mib data with hs.load("data.mib", lazy=True), RSS memory spikes more than size of dataset. For instance, when loading a 31.3 GB dataset without specifying any chunking, the RSS memory spikes at about 95 GB. With different chunking, the memory spike changes, but is still a problem (e.g. RSS spikes at 73 GB with a (64, 64, 64, 64) chunking of a (800, 320|256,256) uint16 dataset). This figure shows the RSS memory usage as a function of time when running hs.load(..., lazy=True, chunks=(64, 64, 64,64)):

To Reproduce

Steps to reproduce the behavior:

signal = hs.load("data.mib", lazy=True)

Expected behavior

The RSS memory requirement shouldn't exceed the dataset size on disk, and should be much lower when loaded lazily.

Python environement:

RosettaSciIO version: 0.4
Python version: 3.11.9
HyperSpy version: 2.1.0
Pyxem version: 0.18.0

Additional context

The problem was encountered on a HPC cluster, but I assume the problem will persist on other machines.

CSSFrancis commented 4 months ago

@emichr can you try the same thing with chunks of (16,16,256,256)?

emichr commented 4 months ago

@CSSFrancis The peak got a little lower, but it is still significant (the time is also significant): Figure 2_2

I just tested on a laptop and there are no issues. So it might be an issue with the cluster (or the environment of course). The environment .yml file from the HPC looks like this: pyxem_hpc.txt

It might be an issue better discussed with the particular HPC infrastructure, but I guess other users might get into this issue down the line.

ericpre commented 4 months ago

Could it be that memmap isn't playing well with the distributed scheduler and the mib reader would need the same as https://github.com/hyperspy/rosettasciio/pull/162?

@emichr, just to check this is using the distributed scheduler?

CSSFrancis commented 4 months ago

@emichr One thing you might try is to first load the data not using a distributed scheduler and then save it as a zarr ZipStore file. That can be read using a distributed scheduler but not written using a distributed scheduler. I tend to like to store my data as a zip store when I can (that's how hdf5 stores their data) even if it is a little bit slower it might be more convieniet in this case.

import zarr
import hyperspy.api as hs
s =hs.load("../../Downloads/FeAl_stripes.hspy", lazy=True)
s_zip2 = zarr.ZipStore("FeAl_stripes.zspy")
s.save(s_zip2)

Or you can always just zip the data after as per the note in the documentation: https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore

magnunor commented 4 months ago

Testing this on my own computer, I see exactly the same memory issue as @emichr.

Doing this uses a lot of memory:

from rsciio.quantumdetector._api import file_reader
data = file_reader("005_test.mib", lazy=True) # data is a dask array

Doing this is fast, and uses almost no memory

from rsciio.quantumdetector._api import load_mib_data
data = load_mib_data("005_test.mib", lazy=True) # data is a numpy memmap

Trying to make it into a dask array is slow and uses a lot of memory:

dask_array = da.from_array(data)

I tested this on dask version '2024.5.0'

CSSFrancis commented 4 months ago

Huh... That shouldn't be the case unless somehow the da.from_array function is converting the memmap array into a numpy array.

I wonder if this is an upstream bug in dask, @magnunor do you know if this worked previously?

magnunor commented 4 months ago

Seems to be due to a change in dask:

mamba install dask=2024.1.0 does NOT have this issue
mamba install dask=2024.1.1 does NOT have this issue
mamba install dask=2024.2.0 does have this issue
mamba install dask=2024.3.0 does have this issue
mamba install dask=2024.4.0 does have this issue

Ergo, it seems like this was introduced in dask version 2024.2.0.

emichr commented 4 months ago

Could it be that memmap isn't playing well with the distributed scheduler and the mib reader would need the same as #162?

@emichr, just to check this is using the distributed scheduler?

@ericpre I assume that I am as I haven't changed any of the defaults in hyperspy or pyxem and just use those packages "out of the box" so to speak. Not too sure how I could check this more thoroughly though, I'm not too much into dask to be frank.

magnunor commented 4 months ago

I'm making a minimal working example for posting it on the dask github, as they might know a bit more about how to resolve this.

magnunor commented 4 months ago

I made an issue about this on the dask github: https://github.com/dask/dask/issues/11152

magnunor commented 4 months ago

I tested this on a Windows and a Linux computer: same issue on both.

magnunor commented 4 months ago

@sivborg found the dask commit which introduced the issue: https://github.com/dask/dask/issues/11152#issuecomment-2137561906

So for now, a temporary fix is to downgrade dask to 2024.1.1. For Anaconda:

conda install dask=2024.1.1 -c conda-forge

ericpre commented 4 months ago

Could it be that memmap isn't playing well with the distributed scheduler and the mib reader would need the same as #162? @emichr, just to check this is using the distributed scheduler?

@ericpre I assume that I am as I haven't changed any of the defaults in hyperspy or pyxem and just use those packages "out of the box" so to speak. Not too sure how I could check this more thoroughly though, I'm not too much into dask to be frank.

Even if this seems to be irrelevant here, for completeness, I will elaborate on my previous comment: as you mentioned that this was happening in a cluster, I assume that you were using the distributed scheduler - otherwise, it will not scale! There is briefly mentioned in the user guide but there are more details in the dask documentation.

CSSFrancis commented 4 months ago

Just adding on to that. If you wanted to load .mib files using memmap and the distributed backend you would have to adjust how the data is loaded: This part of the dask documentation describes how to do that: https://docs.dask.org/en/latest/array-creation.html?highlight=memmap#memory-mapping

ericpre commented 4 months ago

Yesterday, I had a quick go at implementing https://github.com/hyperspy/rosettasciio/pull/162 for the mib reader - it was in the back of mind that this is something that need to be done. Anyway, I got stuck with the structured dtype used in the mib reader, because it messed up with chunks and shape... @CSSFrancis, if I make a PR of my WIP on this, are you happy to continue it? Considering that you have a better understanding than me on that, you may be able to finish it more easily! 😄

CSSFrancis commented 4 months ago

@ericpre Yea I can do that (maybe not for a couple of weeks). I've done something similar in #11 with segmented detector but the file format isn't really used anymore so I kind of moved on from it.

magnunor commented 4 months ago

This seems to have been fixed in the current development version of dask: https://github.com/dask/dask/pull/11161

Not sure when the next dask release will be.

ericpre commented 4 months ago

Dask usually releases on Friday every 2 weeks. The next one should be on the 14th June.

ericpre commented 4 months ago

I checked quickly and this is fixed using the main branch of dask.

hyperspy / rosettasciio