Closed JSKenyon closed 2 years ago
This PR adds support for index_columns
during conversion from measurement set to parquet/zarr. This means that output data can be written with a different ordering than input data. This is not necessarily efficient as it will use TAQL and fragmented reads to accomplish the reordering.
In addition to exposing this feature, this PR fixes incorrect conversion of boolean values when using parquet. This is because pyarrow represents boolean values as bits whereas python represents them as bytes. The fix in this PR works but is a prelude to a slight rewrite of the arrow extensions. Currently, extensions use the from_buffer
to convert from arrow to numpy. This relies on memory layout (hence the aforementioned bug) and may be unnecessary. A future PR will instead use array.storage.flatten
in conjunction with to_numpy
and reshape
to move between representations. This should be less vulnerable to discrepancies in memory layout.
I changed my mind - I imagine that the use of buffers if future-proofing against ragged data. I figured out how to use buffers with booleans and have implemented a fix. Note that this is still not a zero-copy operation in the case of booleans as we have to move from bit to byte representation.
[ ] Tests added / passed
If the pep8 tests fail, the quickest way to correct this is to run
autopep8
and thenflake8
andpycodestyle
to fix the remaining issues.[ ] Fully documented, including
HISTORY.rst
for all changes and one of thedocs/*-api.rst
files for new APITo build the docs locally: