ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS
https://dask-ms.readthedocs.io
Other
19 stars 7 forks source link

Investigate use of dask Array meta attributes to handle lists internally #42

Closed sjperkins closed 4 years ago

sjperkins commented 5 years ago

In the case of non-numeric data (primarily strinigs), python-casacore return lists of objects when issuing getcol and putcol type commands.

This conflicts a little with dask, as it expects ndarrays to represent the data in each chunk. Historically, simply using lists to represent chunk data would cause dask's internal array stitching operations to break. In the getcol case, this can be worked around by casting the list of objects to an ndarray of objects.

In the putcol case, one needs to convert dask's ndarrays of objects back to a list with data.tolist(). Without this conversion, python-casacore can segfault. This works (for 1D lists at least) and so we have a workaround.

dask 2.0.0 added a meta attribute to dask Arrays, which contains metadata describing the type and dimensionality of the data representing each chunk. https://github.com/dask/dask/issues/4070. In future, it may be possible to use this to properly handle lists as chunks.

sjperkins commented 5 years ago

This issue serves to describe the problem, current workarounds and possible, proper, future fixes.

sjperkins commented 4 years ago

Not possible with lists