saeyslab / napari-sparrow

https://sparrow-pipeline.readthedocs.io/en/latest/
Other
21 stars 0 forks source link

dask: make allocation possible #101

Closed lopollar closed 1 year ago

lopollar commented 1 year ago

Currently, we read in the csv as pandas, but this isn't a good idea for big data. The best would be to run everything in dask, however:

df["cells"] = masks[ df['global_y'].values.astype(int) - ic.data.attrs["coords"].y0, df['global_x'].values.astype(int) - ic.data.attrs["coords"].x0, ]

doesn't work in dask as the output is a np array and this doesn't fit in dask

df['cells']=dd.from_array(masks[ df['y'].values.astype(int) - ic.data.attrs["coords"].y0, df['x'].values.astype(int) - ic.data.attrs["coords"].x0 ])

Doesn't work because of indexing issues

df['cells']=dd.from_array(masks[ df['y'].values.astype(int) - ic.data.attrs["coords"].y0, df['x'].values.astype(int) - ic.data.attrs["coords"].x0 ]).reset_index().set_index('index')

Also doesn't work

and dask itself doesn't provide multi-indexing, so this doesn't work either

dask.array.from_array(masks)[ df['y'].values.astype(int) - ic.data.attrs["coords"].y0, df['x'].values.astype(int) - ic.data.attrs["coords"].x0 ]

My current solution is to compute right before this step, but I am afraid this will still overload memory on the whole vizgen dataset.

lopollar commented 1 year ago

This is fixed by @ArneDefauw