hyperspy / rosettasciio

Python library for reading and writing scientific data format
https://hyperspy.org/rosettasciio
GNU General Public License v3.0
51 stars 28 forks source link

Kernal crashing while saving lazy signal from ripple file #329

Open jeinsle opened 3 weeks ago

jeinsle commented 3 weeks ago

Describe the bug

I have loaded a ripple file which is 366 GB in size as a lazy signal. when I try saving using the command data.save() it starts the process. Usually after getting through 20% of the data the kernal crashes.

I have changed the chunk size from 1.1 GB, 785 mb, 384 mb and 96 mb. Also with the 96 mb chunk size I have tried using the command dask.config.set(scheduler='single-threaded'). All of these result in the kernal crashing.

To Reproduce

Steps to reproduce the behavior:

import hyperspy.api as hs
import exspy as ex
import numpy as np
import dask
import dask.array as da
from dask.distributed import Client, LocalCluster
import h5py

#load actual data
sands_ref_eds = hs.load('Montaged_Map_Data-Sand_LA_EDS_500x_Montage_Data.rpl',lazy=True).T
sands_ref_eds

#rechunk data to be 96 mb/chunk
tile_size = np.array([512,384])
print(tile_size)
rechunk_size=(tile_size[1] *0.25,tile_size[0] *0.25)
sands_ref_eds.rechunk( nav_chunks = rechunk_size)
sands_ref_eds

#save the data
dask.config.set(scheduler='single-threaded') 
print('saving ', 'ref_sands_calibrated_eds_map')
sands_ref_eds.save('ref_sands_calibrated_eds_map_r_005',chunks=sands_ref_eds.data.chunksize)
print('saved')

Expected behavior

File should save as an HSPY.

Python environement:

Additional context

Note immediatly saving the data as a hyperspy does not result in this crash. however, I still need to test the behavior of saving a file once I add in metadata etc for this file.

CSSFrancis commented 3 weeks ago

@jeinsle One thing to keep track of is what is the initial chunk configuration. I think that currently we don't allow the chunk size to be passed for memmaped datasets which can cause issues such as this.

Tracking through HyperSpy:

https://github.com/hyperspy/hyperspy/blob/8bc57e1a809668d54da8d4f355ab0b18520fa4ad/hyperspy/io.py#L632 --> https://github.com/hyperspy/hyperspy/blob/RELEASE_next_minor/hyperspy/_signals/lazy.py#L424 --> https://github.com/hyperspy/hyperspy/blob/8bc57e1a809668d54da8d4f355ab0b18520fa4ad/hyperspy/_signals/lazy.py#L440

We should allow a "chunks" parameter to explicitly dictate how the chunks are formed for binary files. More than likely your data is being loaded in a less than ideal chunking pattern and then once you do the transpose hyperspy will sometimes try to force your data into different chunks in order for things like the map function to work.

Maybe we can try and figure out a good workaround and then we can think about the right thing to do.

This is untested so you might need to edit this workflow a bit...

from rsciio.ripple import file_reader
import dask.array as da
import hyperspy.api as hs
file_dict = file_reader('Montaged_Map_Data-Sand_LA_EDS_500x_Montage_Data.rpl', lazy=True

lazy_data = da.from_array(file_dict.pop("data"), chunks = (-1,100, 1)) # You might need to play around with this...
s = hs.signals.Signal1D(data=lazy_data, **file_dict ).T

s.save('Montaged_Map_Data-Sand_LA_EDS_500x_Montage_Data.zspy') # I like .zspy for larger files usually... 

s = hs.load("Montaged_Map_Data-Sand_LA_EDS_500x_Montage_Data.zspy")

For future considerations we can add the ability to do distributed loading so you can use the dask-dashboard which is helpful for debugging these types of things. Is this a typical dataset size for you? Are you interested in taking larger datasets? If so then we can spend a bit of time streamlining/optimizing this workflow.

jeinsle commented 3 weeks ago

@CSSFrancis thanks for the reply

  1. not sure I fully follow the argument here, but I think I am seeing this now as opposed to earlier this summer when I was creating a seperate hdf5 (generic) file and then saving the dask array there as opposed to letting hyperspy handle most of the work.

  2. the inital load for this dataset would be: image

at moment I have sucessfully run the save command, but it did take like 1.5 hours to save as I think it was running in a single threaded state. It would be nice if this could be sped up.... I think it should be possiable given the fact that we are using DASK arrays.

  1. for working on the data we would rather look at something like this: image

What I am trying to do with the rechunk here is preserve the energy axis on the data. This summer I was working with a different big data set, and did not seem to have as much bother, and chould actually run the chunk size closer to 1 GB (roughly 4x in the x and y directions) due to the ram and number of processors. For various reasons my current dataset offers new problems.

  1. I think more Dask will only help. This is kind of the common dataset size that my group works with now, as we are getting full thin section EDS maps that we want to process.
CSSFrancis commented 3 weeks ago

Just another potential thought but dask had a bug at the beginning of the year which would also affect the ripple file loader. So just make sure you have the most recent version of dask?

I have some other comments on the chunking but have a meeting in a bit so I'll come back to this.

https://github.com/hyperspy/rosettasciio/issues/266

jeinsle commented 3 weeks ago

@CSSFrancis it might make sense to organise a call and chat dask stuff, as I think there are a few items that I do not know what to call and think I have missed something in my slapdash reading of the docs. As noted saving immediatly on opening helps, but now I am running into problems when I try to scaling the data and prepare for some PCA-clustering piplineing.

ericpre commented 3 weeks ago

at moment I have sucessfully run the save command, but it did take like 1.5 hours to save as I think it was running in a single threaded state. It would be nice if this could be sped up.... I think it should be possiable given the fact that we are using DASK arrays.

This is not surprising and this is most likely a limitation with h5py - with zspy, it will be much faster.

@CSSFrancis, you are right, it would be good to add the chunks and distributed keywords to the ripple reader.

jeinsle commented 3 weeks ago

@ericpre this is good to know, but really the time comment was more on the speed than the root cause as I was working with some rather big data sets this summer and not running into the kernal cashing... it is just odd. but going to do a test now of open and immediatly save with zspy and then merge in some metadata and see if that works better.

ericpre commented 3 weeks ago

@ericpre this is good to know, but really the time comment was more on the speed than the root cause as I was working with some rather big data sets this summer and not running into the kernal cashing... it is just odd.

Are you saying the same process wasn't crashing this summer but it is now? As @CSSFrancis mentioned above, it could be due to a regression in dask? Reading #266, the dask versions with the bug are between 2024.2.0 and 2024.6.0.

If you want to save some metadata without having to save the file, you can use the write_dataset=False to avoid having to overwrite the dataset.

CSSFrancis commented 3 weeks ago

@CSSFrancis it might make sense to organise a call and chat dask stuff, as I think there are a few items that I do not know what to call and think I have missed something in my slapdash reading of the docs. As noted saving immediatly on opening helps, but now I am running into problems when I try to scaling the data and prepare for some PCA-clustering piplineing.

Sure I'd be willing to set up a video call. It might be good to discuss a couple of things. I'm not overly surprised that the PCA- Clustering doesn't perform ideally with 300+ GB. The dask-code for running PCA is a bit less efficient than it could be. I think the dask-ml function might work faster.

One thing to consider fairly seriously is if you need to run PCA on the entire dataset or you could run it on a subset and then apply that to the entire dataset. As dask-ml states Not Everyone needs Scalable ML-- Tools like sampling can be effective.

jeinsle commented 3 weeks ago

@ericpre this is good to know, but really the time comment was more on the speed than the root cause as I was working with some rather big data sets this summer and not running into the kernal cashing... it is just odd.

Are you saying the same process wasn't crashing this summer but it is now? As @CSSFrancis mentioned above, it could be due to a regression in dask? Reading #266, the dask versions with the bug are between 2024.2.0 and 2024.6.0.

If you want to save some metadata without having to save the file, you can use the write_dataset=False to avoid having to overwrite the dataset.

Hi @ericpre yeah, I had a slightly larger map this summer that I managed to just load it and save as a hspy file with out any bother. I even managed to get it to sum along the signal axis and then save that output (which was now signifficantly smaller).

That said last night I tried resaving the data using zspy like this:

sands_ref_eds = hs.load('Montaged_Map_Data-Sand_LA_EDS_500x_Montage_Data.rpl',lazy=True).T
start = datetime.now()

sands_ref_eds.save('ref_sands_calibrated_eds_map_r_007.zspy')

finish=datetime.now()
print('run time =', finish-start)

As you can see from this screen shot it did run and I can reload the file, but the kernal had managed to crash (clear that mesage when I got in).

image

So I am not really sure what is happening here. Even more confused as had not updateed my enviroment until yesterday to address the possiable dask rev error (I was at a previously non-recommended dask version)

Question does the [write_dataset=False ] keyword also write axes scaling data? (ie copy over the axes manager?)

jeinsle commented 3 weeks ago

@CSSFrancis it might make sense to organise a call and chat dask stuff, as I think there are a few items that I do not know what to call and think I have missed something in my slapdash reading of the docs. As noted saving immediatly on opening helps, but now I am running into problems when I try to scaling the data and prepare for some PCA-clustering piplineing.

Sure I'd be willing to set up a video call. It might be good to discuss a couple of things. I'm not overly surprised that the PCA- Clustering doesn't perform ideally with 300+ GB. The dask-code for running PCA is a bit less efficient than it could be. I think the dask-ml function might work faster.

One thing to consider fairly seriously is if you need to run PCA on the entire dataset or you could run it on a subset and then apply that to the entire dataset. As dask-ml states Not Everyone needs Scalable ML-- Tools like sampling can be effective.

@CSSFrancis to clarify, the comment on my ML pipeline was more of, this is where I am going. And I agree that some sampling might be in order, the question becomes how best to sample when a data set is super heterogeneous? For this one I have some ideas that will need to leverage what I built this summer. Regardless, the issue is still related to how best convert a large ripple file into some kind of dask array?

note some of the reason that this montage ripple file is so big is that Oxford Instruments chooses to use a particularly nonsensical naming convention for the individual tiles, with makes it hard to export each tile of the data set independently.

ericpre commented 3 weeks ago

Almost 3h still sounds one order of magnitude too slow...

Question does the [write_dataset=False ] keyword also write axes scaling data? (ie copy over the axes manager?)

This should be clarified in the docstring, this only write the numpy or dask array, anything else, including, axes_manager will be overwritten.

jeinsle commented 3 weeks ago

@ericpre agree this is actually slower than what I got using the [.hspy] extension.

so what I can get from the ripple is essentally the dask array with no meta data. We have written small script which rips the data out of the h5oina file and then maps it to the hyperspy keys. It has been this step where things have gone sideways.

jeinsle commented 3 weeks ago

Almost 3h still sounds one order of magnitude too slow...

Question does the [write_dataset=False ] keyword also write axes scaling data? (ie copy over the axes manager?)

This should be clarified in the docstring, this only write the numpy or dask array, anything else, including, axes_manager will be overwritten.

@ericpre yesterday I tried resaving the file with the automatic chunk size (i.e. essentally a stack of quasi energy filetered images) which is how it finally did sucessfully save. However, this time I then added in all the metadata and axes manager information and then saved as a new file. This time it took over 5 hours, for just adding in less than a megabytes worth of information? image

Any thoughts? this seems to have signifficantly slowed down after updating my dask as recomended above.