Closed VolkerH closed 2 years ago
Actually, I just looked a bit into un/pickling and it seems that adding these to methods to the ND2File
class solve the pickling problem:
def __getstate__(self):
state = self.__dict__.copy()
del state['_rdr']
return state
def __setstate__(self, d):
print("I'm being unpickled with these values: " + repr(d))
self.__dict__ = d
self._rdr = get_reader(self._path)
if not self._closed:
self.open()
I can now pickle the dask arrays and when I unpickle them I can run compute()
.
I will submit a PR.
keep ND2File object in open() state when working with dask array
yeah, this is a tricky one... I actually wrote a wrapper over at aicsimageio for just this purpose: https://github.com/AllenCellModeling/aicsimageio/blob/main/aicsimageio/utils/dask_proxy.py
could use it here as well?
Actually, I just looked a bit into un/pickling and it seems that adding these to methods to the ND2File class solve the pickling problem:
awesome thank you!
yeah, this is a tricky one... I actually wrote a wrapper over at aicsimageio for just this purpose: https://github.com/AllenCellModeling/aicsimageio/blob/main/aicsimageio/utils/dask_proxy.py
could use it here as well?
Ugh. I would have to do a bit more reading to fully understand how that proxy works, so not sure about whether this would work here. However, as aicsimageio also incorporates nd2, maybe I should just aicsimageio instead of nd2 (I'd prefer not to have that dependency though).
the gist of that proxy is essentially these two lines:
def compute(self, **kwargs: Any) -> np.ndarray:
with self.__wrapped__._ctx_:
return self.__wrapped__.compute(**kwargs)
def __array__(self, dtype: str = None, **kwargs: Any) -> np.ndarray:
with self.__wrapped__._ctx_:
return self.__wrapped__.__array__(dtype, **kwargs)
that is, it's a dask array that, whenever you try to call compute
or np.asarray
, will re-open the underlying file (with self.__wrapped__._ctx_
is essentially just with ND2File()....
)
It looks a tad bit risky at first, but I haven't run into any issues with it yet. In any case, I suspect the issue of trying to use a dask array after closing the file is far more common than whatever hidden issues there are with this proxy. I'm inclined to try it
It looks a tad bit risky at first, but I haven't run into any issues with it yet. In any case, I suspect the issue of trying to use a dask array after closing the file is far more common than whatever hidden issues there are with this proxy.
I agree. Also, if you do try, you will currently just see a crash, most likely without any usable error message.
So if I understand you correctly, the idea would be that ND2File.to_dask()
returns such an ObjectProxy instead of the dask.array?
So if I understand you correctly, the idea would be that ND2File.to_dask() returns such an ObjectProxy instead of the dask.array?
yeah, exactly.
I think #27 and #26 have improved the situation here. closing until we get another reproducible bug.
I tried on my ubuntu box to get the segfault you showed, but am still failing (even running it a few hundred times in case of race condition)... but let's keep an eye out for it
Hi @tlambert03,
didn't get around to it yesterday but installed the latest version including #27 from the main branch just now. Unfortunately, on my machine this has not been an improvement. The segfault issue when returning the dask array from context manager still persists.
In addition, I now also get dying kernels (probably segfault, but this is something I run in a notebook) when I run some more complex computation on the dask array that worked previously when avoiding returning from a context manager.
I will try and provide a reproducible example, but probably won't get around to it today. For the time being I'm holding on to an older state of the repo (https://github.com/VolkerH/nd2/commit/294e07ab858a8e1001f4240a23b686b191aa71fc) where everything except returning from the context manager works for me.
ugh š i'm sorry. this is a really frustrating one.
I can't get it to segfault on any of my computers (tried mac, win, ubuntu20)... maybe it's a race condition that depends on dataset size?
It's certainly an issue though, since it looks like aicsimageio is also having CI issues with the new release. Will keep digging... but let me know if you hit on anything (and send files if necessary)
Will do. The newly observed segfault happens when I perform map_blocks
on
the dask array, presumably things are happening concurrently for different
blocks.
I hope to create a minimal example by the end of the week.
On Tue, 16 Nov 2021 at 16:15, Talley Lambert @.***> wrote:
ugh š i'm sorry. this is a really frustrating one.
I can't get it to segfault on any of my computers (tried mac, win, ubuntu20)... maybe it's a race condition that depends on dataset size?
It's certainly an issue though, since it looks like aicsimageio is also having CI issues with the new release. Will keep digging... but let me know if you hit on anything (and send files if necessary)
ā You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tlambert03/nd2/issues/19#issuecomment-970372564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3R7D4PXFVJIZS2AO67TLUMJYSNANCNFSM5HXM6MZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
BTW, no need to be sorry and thanks for all the great support so far !
On Tue, 16 Nov 2021 at 16:22, Volker Hilsenstein < @.***> wrote:
Will do. The newly observed segfault happens when I perform
map_blocks
on the dask array, presumably things are happening concurrently for different blocks. I hope to create a minimal example by the end of the week.On Tue, 16 Nov 2021 at 16:15, Talley Lambert @.***> wrote:
ugh š i'm sorry. this is a really frustrating one.
I can't get it to segfault on any of my computers (tried mac, win, ubuntu20)... maybe it's a race condition that depends on dataset size?
It's certainly an issue though, since it looks like aicsimageio is also having CI issues with the new release. Will keep digging... but let me know if you hit on anything (and send files if necessary)
ā You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tlambert03/nd2/issues/19#issuecomment-970372564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3R7D4PXFVJIZS2AO67TLUMJYSNANCNFSM5HXM6MZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Ok, I started debugging this. The problem appears to be that the pickling doesn't work correctly if there are additional operations in the compute graph. I don't have a stand-alone example yet but maybe the following will already be giving you enough information:
I have the following function that loads an nd2 and performs some re-sorting (using np.einsum
) of dimensions before returning it
def load_tiles_nd2(file_path: PathLike) -> da.array:
file_path = Path(file_path)
if file_path.suffix.lower() != ".nd2":
raise (ValueError, "Can only read nd2 files.")
f = ND2File(file_path)
tile_array = f.to_dask()
dim_order_tuple = f.sizes.keys()
dim_order = "".join(dim_order_tuple).lower()
# the followig is basically np.einsum (f"{dim_order}->pcyx")
ordered_tile_array = to_tczyx(tile_array, in_order=dim_order, out_order="pcyx")
return ordered_tile_array
So when I use this as in
arr = load_tiles_nd2("samplefile.nd2")
arr[..., :10].compute()
there is no problem.
However, when I do
tmp = load_tiles_nd2("samplefile.nd2")
arr = pickle.loads(pickle.dumps(tmp))
arr[..., :10].compute()
I get a segmentation fault.
When I look at the dask graph of the ResourceBackedArray that hasn't been pickled looks like that:
In contrast, the dask task graph for the ResourceBackedArray that has been pickled and regenerated by unpickling looks like this:
So the layers 0
to 2
lose important attributes such as .shape
and .chunksize
when pickling/unpickling.
Arghh. Now I can't reproduce the SegFault with a simple [...,:10].compute()
reliably anymore, I can only reproduce it with a more complicated map_blocks
. Will try to narrow it down further, but maybe the differences in the compute graph layers are relevant.
So the segfault seems to depend on the slicing.
tile_arr[...,:10].compute()
works. This slice covers multiple source chunks if i am not mistaken.tiles_arr[0,0,...].compute()
causes segfault, this slice only covers a single source chunk.Thanks for digging! By the way I got a good response on SO about the general problem of adding tasks to a dask array graph if you're curious. , might try it here:
Thanks, interesting, I didn't even think about graph optmiziation steps.
I sent you a screen recording of a debug session in action through the napari zulip chat. Unfortunately can't attach .webm
to Github.
I think this has been solved over the iterations. currently:
In [1]: import nd2
In [2]: f = nd2.ND2File('tests/data/dims_p1z5t3c2y32x32.nd2')
In [3]: d = f.to_dask()
In [4]: f.close()
In [5]: import cloudpickle
In [6]: p = cloudpickle.dumps(d) # no problem
Hi Talley,
a couple of observations regarding the dask arrays returned by
nd2
. None of this is terribly surprising (so you are probbably aware), and I am not expecting this can really be fixed. But maybe some lines in the readme file could save users some headaches.keep ND2File object in open() state when working with dask array
For my use case I created a little helper function (pseudo-code):
I have similiar functions for other file types, e.g. something like
provide_tiles_as_daskarray_from_folder_of_imgs(filepath)
.On first sight, this works. I used this function from a Jupyter Notebook and it seemed as if it returns a working dask array. Until you do anything that forces a
compute()
, in which case the python kernel crashes without a traceback. The issue is thatf
is closed when returning out of the context manager. The same thing happens if you don't use a context manager and callf.close()
before returning. Again, not terribly surprising. When running in the VS Code debugger I actually saw a trace back and saw that there is a segmentation fault when the nd2 library is trying to read data.My workaround is to not close
f
which is a bit dirty as there will be open file handles hanging around possibly causing memory leaks.cloudpickle
The dask arrays returned by
nd2
can't be serialized using cloudpickle due to the ND2File object not being pickleable:Again, not terribly surprising. However, the implication is (I haven't tried it, but am fairly certain this is the case) that the dask arrays returned by nd2 cannot be used in a
dask.distributed
context where the pickled array needs to be sent to the various worker processes.A workaround for both would be to create a temporary .zarr file (pseudo-code):
with obvious drawback of duplicating the required storarge space.
I don't know whether there is much that can be done about this, but it may be useful to add a note of caution in the Readme.