issues with dask arrays returned by nd2: depend on ND2File being open, can't be pickled

VolkerH commented 2 years ago

Hi Talley,

a couple of observations regarding the dask arrays returned by nd2. None of this is terribly surprising (so you are probbably aware), and I am not expecting this can really be fixed. But maybe some lines in the readme file could save users some headaches.

keep ND2File object in open() state when working with dask array

For my use case I created a little helper function (pseudo-code):

def provide_tiles_as_daskarray_from_nd2(filepath):
      with nd2.ND2File(filepath) as f:
            arr = f.to_dask()
            # do a few more things on array: cropping, sorting dimensions
            return arr

I have similiar functions for other file types, e.g. something like provide_tiles_as_daskarray_from_folder_of_imgs(filepath).

On first sight, this works. I used this function from a Jupyter Notebook and it seemed as if it returns a working dask array. Until you do anything that forces a compute(), in which case the python kernel crashes without a traceback. The issue is that f is closed when returning out of the context manager. The same thing happens if you don't use a context manager and call f.close() before returning. Again, not terribly surprising. When running in the VS Code debugger I actually saw a trace back and saw that there is a segmentation fault when the nd2 library is trying to read data.

My workaround is to not close f which is a bit dirty as there will be open file handles hanging around possibly causing memory leaks.

cloudpickle

The dask arrays returned by nd2 can't be serialized using cloudpickle due to the ND2File object not being pickleable:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-8a76360da592> in <module>
----> 1 cloudpickle.dump(array,f)

~/miniconda3/envs/napari_latest/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dump(obj, file, protocol, buffer_callback)
     53         compatibility with older versions of Python.
     54         """
---> 55         CloudPickler(
     56             file, protocol=protocol, buffer_callback=buffer_callback
     57         ).dump(obj)

~/miniconda3/envs/napari_latest/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dump(self, obj)
    561     def dump(self, obj):
    562         try:
--> 563             return Pickler.dump(self, obj)
    564         except RuntimeError as e:
    565             if "recursion" in e.args[0]:

~/miniconda3/envs/napari_latest/lib/python3.9/site-packages/nd2/_sdk/latest.cpython-39-x86_64-linux-gnu.so in nd2._sdk.latest.ND2Reader.__reduce_cython__()

TypeError: no default __reduce__ due to non-trivial __cinit__

Again, not terribly surprising. However, the implication is (I haven't tried it, but am fairly certain this is the case) that the dask arrays returned by nd2 cannot be used in a dask.distributed context where the pickled array needs to be sent to the various worker processes.

A workaround for both would be to create a temporary .zarr file (pseudo-code):

def provide_tiles_as_daskarray_from_nd2(filepath):
      with nd2.ND2File(filepath) as f:
            arr = f.to_dask()
            # do a few more things on array: cropping, sorting dimensions
            arr.to_zarr("_tmpfile.zarr")
            f.close()
            return dask.array.from_zarr("_tmpfile.zarr")

with obvious drawback of duplicating the required storarge space.

I don't know whether there is much that can be done about this, but it may be useful to add a note of caution in the Readme.

VolkerH commented 2 years ago

Actually, I just looked a bit into un/pickling and it seems that adding these to methods to the ND2File class solve the pickling problem:

    def __getstate__(self):
        state = self.__dict__.copy()
        del state['_rdr']
        return state 

    def __setstate__(self, d):
        print("I'm being unpickled with these values: " + repr(d))
        self.__dict__ = d    
        self._rdr = get_reader(self._path)
        if not self._closed:
            self.open()

I can now pickle the dask arrays and when I unpickle them I can run compute().

I will submit a PR.

tlambert03 commented 2 years ago

keep ND2File object in open() state when working with dask array

yeah, this is a tricky one... I actually wrote a wrapper over at aicsimageio for just this purpose: https://github.com/AllenCellModeling/aicsimageio/blob/main/aicsimageio/utils/dask_proxy.py

could use it here as well?

Actually, I just looked a bit into un/pickling and it seems that adding these to methods to the ND2File class solve the pickling problem:

awesome thank you!

VolkerH commented 2 years ago

yeah, this is a tricky one... I actually wrote a wrapper over at aicsimageio for just this purpose: https://github.com/AllenCellModeling/aicsimageio/blob/main/aicsimageio/utils/dask_proxy.py

could use it here as well?

Ugh. I would have to do a bit more reading to fully understand how that proxy works, so not sure about whether this would work here. However, as aicsimageio also incorporates nd2, maybe I should just aicsimageio instead of nd2 (I'd prefer not to have that dependency though).

tlambert03 commented 2 years ago

the gist of that proxy is essentially these two lines:

    def compute(self, **kwargs: Any) -> np.ndarray:
        with self.__wrapped__._ctx_:
            return self.__wrapped__.compute(**kwargs)

    def __array__(self, dtype: str = None, **kwargs: Any) -> np.ndarray:
        with self.__wrapped__._ctx_:
            return self.__wrapped__.__array__(dtype, **kwargs)

that is, it's a dask array that, whenever you try to call compute or np.asarray, will re-open the underlying file (with self.__wrapped__._ctx_ is essentially just with ND2File()....)

It looks a tad bit risky at first, but I haven't run into any issues with it yet. In any case, I suspect the issue of trying to use a dask array after closing the file is far more common than whatever hidden issues there are with this proxy. I'm inclined to try it

VolkerH commented 2 years ago

It looks a tad bit risky at first, but I haven't run into any issues with it yet. In any case, I suspect the issue of trying to use a dask array after closing the file is far more common than whatever hidden issues there are with this proxy.

I agree. Also, if you do try, you will currently just see a crash, most likely without any usable error message.

So if I understand you correctly, the idea would be that ND2File.to_dask() returns such an ObjectProxy instead of the dask.array?

tlambert03 commented 2 years ago

So if I understand you correctly, the idea would be that ND2File.to_dask() returns such an ObjectProxy instead of the dask.array?

yeah, exactly.

tlambert03 commented 2 years ago

I think #27 and #26 have improved the situation here. closing until we get another reproducible bug.

I tried on my ubuntu box to get the segfault you showed, but am still failing (even running it a few hundred times in case of race condition)... but let's keep an eye out for it

VolkerH commented 2 years ago

Hi @tlambert03,

didn't get around to it yesterday but installed the latest version including #27 from the main branch just now. Unfortunately, on my machine this has not been an improvement. The segfault issue when returning the dask array from context manager still persists.

In addition, I now also get dying kernels (probably segfault, but this is something I run in a notebook) when I run some more complex computation on the dask array that worked previously when avoiding returning from a context manager.

I will try and provide a reproducible example, but probably won't get around to it today. For the time being I'm holding on to an older state of the repo (https://github.com/VolkerH/nd2/commit/294e07ab858a8e1001f4240a23b686b191aa71fc) where everything except returning from the context manager works for me.

tlambert03 commented 2 years ago

ugh 😔 i'm sorry. this is a really frustrating one.

I can't get it to segfault on any of my computers (tried mac, win, ubuntu20)... maybe it's a race condition that depends on dataset size?

It's certainly an issue though, since it looks like aicsimageio is also having CI issues with the new release. Will keep digging... but let me know if you hit on anything (and send files if necessary)

VolkerH commented 2 years ago

Will do. The newly observed segfault happens when I perform map_blocks on the dask array, presumably things are happening concurrently for different blocks. I hope to create a minimal example by the end of the week.

On Tue, 16 Nov 2021 at 16:15, Talley Lambert @.***> wrote:

ugh 😔 i'm sorry. this is a really frustrating one.

I can't get it to segfault on any of my computers (tried mac, win, ubuntu20)... maybe it's a race condition that depends on dataset size?

It's certainly an issue though, since it looks like aicsimageio is also having CI issues with the new release. Will keep digging... but let me know if you hit on anything (and send files if necessary)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tlambert03/nd2/issues/19#issuecomment-970372564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3R7D4PXFVJIZS2AO67TLUMJYSNANCNFSM5HXM6MZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

VolkerH commented 2 years ago

BTW, no need to be sorry and thanks for all the great support so far !

On Tue, 16 Nov 2021 at 16:22, Volker Hilsenstein < @.***> wrote:

Will do. The newly observed segfault happens when I perform map_blocks on the dask array, presumably things are happening concurrently for different blocks. I hope to create a minimal example by the end of the week.

On Tue, 16 Nov 2021 at 16:15, Talley Lambert @.***> wrote:

ugh 😔 i'm sorry. this is a really frustrating one.

I can't get it to segfault on any of my computers (tried mac, win, ubuntu20)... maybe it's a race condition that depends on dataset size?

It's certainly an issue though, since it looks like aicsimageio is also having CI issues with the new release. Will keep digging... but let me know if you hit on anything (and send files if necessary)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tlambert03/nd2/issues/19#issuecomment-970372564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3R7D4PXFVJIZS2AO67TLUMJYSNANCNFSM5HXM6MZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

VolkerH commented 2 years ago

Ok, I started debugging this. The problem appears to be that the pickling doesn't work correctly if there are additional operations in the compute graph. I don't have a stand-alone example yet but maybe the following will already be giving you enough information:

I have the following function that loads an nd2 and performs some re-sorting (using np.einsum) of dimensions before returning it

def load_tiles_nd2(file_path: PathLike) -> da.array:
    file_path = Path(file_path)
    if file_path.suffix.lower() != ".nd2":
        raise (ValueError, "Can only read nd2 files.")
    f = ND2File(file_path)
    tile_array = f.to_dask()
    dim_order_tuple = f.sizes.keys()
    dim_order = "".join(dim_order_tuple).lower()
    # the followig is basically np.einsum (f"{dim_order}->pcyx")
    ordered_tile_array = to_tczyx(tile_array, in_order=dim_order, out_order="pcyx")
    return ordered_tile_array

So when I use this as in

arr = load_tiles_nd2("samplefile.nd2")
arr[..., :10].compute()

there is no problem.

However, when I do

tmp = load_tiles_nd2("samplefile.nd2")
arr = pickle.loads(pickle.dumps(tmp))
arr[..., :10].compute()

I get a segmentation fault.

When I look at the dask graph of the ResourceBackedArray that hasn't been pickled looks like that:

In contrast, the dask task graph for the ResourceBackedArray that has been pickled and regenerated by unpickling looks like this:

So the layers 0 to 2 lose important attributes such as .shape and .chunksize when pickling/unpickling.

VolkerH commented 2 years ago

Arghh. Now I can't reproduce the SegFault with a simple [...,:10].compute() reliably anymore, I can only reproduce it with a more complicated map_blocks. Will try to narrow it down further, but maybe the differences in the compute graph layers are relevant.

VolkerH commented 2 years ago

So the segfault seems to depend on the slicing.

tile_arr[...,:10].compute() works. This slice covers multiple source chunks if i am not mistaken.
tiles_arr[0,0,...].compute() causes segfault, this slice only covers a single source chunk.

tlambert03 commented 2 years ago

Thanks for digging! By the way I got a good response on SO about the general problem of adding tasks to a dask array graph if you're curious. , might try it here:

https://stackoverflow.com/a/70012999

VolkerH commented 2 years ago

Thanks, interesting, I didn't even think about graph optmiziation steps. I sent you a screen recording of a debug session in action through the napari zulip chat. Unfortunately can't attach .webm to Github.

tlambert03 commented 2 years ago

I think this has been solved over the iterations. currently:

In [1]: import nd2

In [2]: f = nd2.ND2File('tests/data/dims_p1z5t3c2y32x32.nd2')

In [3]: d = f.to_dask()

In [4]: f.close()

In [5]: import cloudpickle

In [6]: p = cloudpickle.dumps(d)  # no problem

tlambert03 / nd2

issues with dask arrays returned by nd2: depend on ND2File being open, can't be pickled #19

keep ND2File object in open() state when working with dask array

cloudpickle