Closed lopollar closed 1 year ago
Currently, we read in the csv as pandas, but this isn't a good idea for big data. The best would be to run everything in dask, however:
df["cells"] = masks[ df['global_y'].values.astype(int) - ic.data.attrs["coords"].y0, df['global_x'].values.astype(int) - ic.data.attrs["coords"].x0, ]
doesn't work in dask as the output is a np array and this doesn't fit in dask
df['cells']=dd.from_array(masks[ df['y'].values.astype(int) - ic.data.attrs["coords"].y0, df['x'].values.astype(int) - ic.data.attrs["coords"].x0 ])
Doesn't work because of indexing issues
df['cells']=dd.from_array(masks[ df['y'].values.astype(int) - ic.data.attrs["coords"].y0, df['x'].values.astype(int) - ic.data.attrs["coords"].x0 ]).reset_index().set_index('index')
Also doesn't work
and dask itself doesn't provide multi-indexing, so this doesn't work either
dask.array.from_array(masks)[ df['y'].values.astype(int) - ic.data.attrs["coords"].y0, df['x'].values.astype(int) - ic.data.attrs["coords"].x0 ]
My current solution is to compute right before this step, but I am afraid this will still overload memory on the whole vizgen dataset.
This is fixed by @ArneDefauw
Currently, we read in the csv as pandas, but this isn't a good idea for big data. The best would be to run everything in dask, however:
df["cells"] = masks[ df['global_y'].values.astype(int) - ic.data.attrs["coords"].y0, df['global_x'].values.astype(int) - ic.data.attrs["coords"].x0, ]
doesn't work in dask as the output is a np array and this doesn't fit in dask
df['cells']=dd.from_array(masks[ df['y'].values.astype(int) - ic.data.attrs["coords"].y0, df['x'].values.astype(int) - ic.data.attrs["coords"].x0 ])
Doesn't work because of indexing issues
df['cells']=dd.from_array(masks[ df['y'].values.astype(int) - ic.data.attrs["coords"].y0, df['x'].values.astype(int) - ic.data.attrs["coords"].x0 ]).reset_index().set_index('index')
Also doesn't work
and dask itself doesn't provide multi-indexing, so this doesn't work either
dask.array.from_array(masks)[ df['y'].values.astype(int) - ic.data.attrs["coords"].y0, df['x'].values.astype(int) - ic.data.attrs["coords"].x0 ]
My current solution is to compute right before this step, but I am afraid this will still overload memory on the whole vizgen dataset.