arcae
implements a limited subset of functionality from the more mature python-casacore_ package. It bypasses some existing limitations in python-casacore to provide safe, multi-threaded access to CASA formats, thereby enabling export into newer cloud native formats such as Apache Arrow and Zarr.
casacore
and the python-casacore
Python bindings provide access to the CASA Table Data System (CTDS) and Measurement Sets created within this system. The CTDS, as of casacore 3.5.0 is subject to the following limitations:
Access from multiple threads is unsafe.
python-casacore
doesn't drop the Global Interpreter Lock
Resolving these concerns is potentially a major effort, involving invasive changes across the CTDS system.
In the time since the CTDS was developed, newer, open-source formats such as Apache Arrow and Zarr have been developed that are suitable for representing Radio Astronomy data.
Arrow supports both 1D arrays and nested structures:
FixedSizeListArrays <fixed_size_list_layout_>
_ .ListArrays <variable_size_list_layout_>
_.FixedSizeListArray <fixed_size_list_layout_>
_ nesting of two floats.to_numpy
calls on Arrow Arrays, but it is relatively trivial to reinterpret the underlying data buffers from either API. This is done transparently in getcol
and putcol
functions (see usage below).Going forward, FixedShapeTensorArray <fixed_shape_tensor_array_>
_ and VariableShapeTensorArray <variable_shape_tensor_array_>
_ will provide more ergonomic structures for representing multi-dimensional data. First class support for complex values in Apache Arrow will require implementing a C++ extension type <cpp_extension_type_>
_ within Arrow itself:
Some other edge cases have not yet been implemented, but could be with some thought.
Binary wheels are providing for Linux and MacOSX for both x86_64 and arm64 architectures
.. code-block:: bash
$ pip install arcae
Example usage with Arrow Tables:
.. code-block:: python
import json
from pprint import pprint
import arcae
import pyarrow as pa
import pyarrow.parquet as pq
# Obtain (partial) Apache Arrow Table from a CASA Table
casa_table = arcae.table("/path/to/measurementset.ms")
arrow_table = casa_table.to_arrow() # read entire table
arrow_table = casa_table.to_arrow(index=(slice(10, 20),)
assert isinstance(arrow_table, pa.Table)
# Print JSON-encoded Table and Column keywords
pprint(json.loads(arrow_table.schema.metadata[b"__arcae_metadata__"]))
pprint(json.loads(arrow_table.schema.field("DATA").metadata[b"__arcae_metadata__"]))
pq.write_table(arrow_table, "measurementset.parquet")
Some reading and writing functionality from python-casacore_ is replicated,
with added support for some NumPy Advanced Indexing <numpy_advanced_indexing_>
_.
.. code-block:: python
casa_table = arcae.table("/path/to/measurementset.ms", readonly=False)
# Get rows 10 and 2, and channels 16 to 32, and all correlations
data = casa_table.getcol("DATA", index=([10, 2], slice(16, 32), None)
# Write some modified data back
casa_table.putcol("DATA", data + 1*1j, index=([10, 2], slice(16, 32), None)
See the test cases for further use cases.
Install the applications
optional extra.
.. code-block:: bash
pip install arcae[applications]
Then, an export script is available:
.. code-block:: bash
$ arcae export /path/to/the.ms --nrow 50000 $ tree output.arrow/ output.arrow/ ├── ANTENNA │ └── data0.parquet ├── DATA_DESCRIPTION │ └── data0.parquet ├── FEED │ └── data0.parquet ├── FIELD │ └── data0.parquet ├── MAIN │ └── FIELD_ID=0 │ └── PROCESSOR_ID=0 │ ├── DATA_DESC_ID=0 │ │ ├── data0.parquet │ │ ├── data1.parquet │ │ ├── data2.parquet │ │ └── data3.parquet │ ├── DATA_DESC_ID=1 │ │ ├── data0.parquet │ │ ├── data1.parquet │ │ ├── data2.parquet │ │ └── data3.parquet │ ├── DATA_DESC_ID=2 │ │ ├── data0.parquet │ │ ├── data1.parquet │ │ ├── data2.parquet │ │ └── data3.parquet │ └── DATA_DESC_ID=3 │ ├── data0.parquet │ ├── data1.parquet │ ├── data2.parquet │ └── data3.parquet ├── OBSERVATION │ └── data0.parquet
This data can be loaded into an Arrow Dataset:
.. code-block:: python
>>> import pyarrow as pa
>>> import pyarrow.dataset as pad
>>> main_ds = pad.dataset("output.arrow/MAIN")
>>> spw_ds = pad.dataset("output.arrow/SPECTRAL_WINDOW")
Noun: arca f (genitive arcae); first declension A chest, box, coffer, safe (safe place for storing items, or anything of a similar shape)
Pronounced: ar-ki <arcae_pronounce_>
_.
.. _python-casacore: https://github.com/casacore/python-cascore .. _fixed_size_list_layout: https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout .. _variable_size_list_layout: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout .. _fixed_shape_tensor_array: https://arrow.apache.org/docs/python/generated/pyarrow.FixedShapeTensorArray.html .. _variable_shape_tensor_array: https://github.com/apache/arrow/pull/38008 .. _numpy_advanced_indexing: https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing .. _cpp_extension_type: https://arrow.apache.org/docs/cpp/api/datatype.html#extension-types .. _arcae_pronounce: https://translate.google.com/?sl=la&tl=en&text=arcae%0A&op=translate