ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS
https://dask-ms.readthedocs.io
Other
19 stars 7 forks source link

Issue 176 #194

Closed JSKenyon closed 2 years ago

JSKenyon commented 2 years ago
JSKenyon commented 2 years ago

This PR adds support for index_columns during conversion from measurement set to parquet/zarr. This means that output data can be written with a different ordering than input data. This is not necessarily efficient as it will use TAQL and fragmented reads to accomplish the reordering.

In addition to exposing this feature, this PR fixes incorrect conversion of boolean values when using parquet. This is because pyarrow represents boolean values as bits whereas python represents them as bytes. The fix in this PR works but is a prelude to a slight rewrite of the arrow extensions. Currently, extensions use the from_buffer to convert from arrow to numpy. This relies on memory layout (hence the aforementioned bug) and may be unnecessary. A future PR will instead use array.storage.flatten in conjunction with to_numpy and reshape to move between representations. This should be less vulnerable to discrepancies in memory layout.

JSKenyon commented 2 years ago

I changed my mind - I imagine that the use of buffers if future-proofing against ragged data. I figured out how to use buffers with booleans and have implemented a fix. Note that this is still not a zero-copy operation in the case of booleans as we have to move from bit to byte representation.