vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 590 forks source link

Fix: reliable export hdf5 with small chunks size #2280

Closed JovanVeljanoski closed 1 year ago

JovanVeljanoski commented 1 year ago

This fixes some issues we've encountered with exporting some type of data to hdf5. The unit-tests have been scaled the reproduce the issues via minimal data, but the same issues are encountered with the default_chunk_size on larger datasets.

I suspect this is due to a combination of chunk_size and amount of missing/masked values in a particular chunk. I do not know the exact origin of the errors at this point, so the tests have rather generic names.

Each of the unit-tests raises a different error, which is why I created separate tests rather than one in which the amount of data is varied.

Exporting the same data under the same conditions (i.e. chunk_size) to arrow or parquet format works just fine.

Checklist: