Open grafail opened 2 years ago
Thanks - good catch! Let's see if we can improve it.
PRs are welcome of course!
Hi, I also have a use case with a lot of columns. I tried to reproduce this issue in my environment, and observed that after vaex 4.14, even arrow and parquet exports are much slower.
I used python 3.9.18 on Ubuntu 22.04 and Windows 10. I installed vaex with conda using conda-forge channel.
With vaex 4.13, only HDF5 export is slow. (Sorry for pasting 4.12.0 results. I copied from wrong terminal.)
In [1]: import vaex
In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.12.0',
'vaex-viz': '0.5.4',
'vaex-hdf5': '0.12.3',
'vaex-server': '0.8.1',
'vaex-astro': '0.9.3',
'vaex-jupyter': '0.8.1',
'vaex-ml': '0.18.1'}
In [3]: df = vaex.open("/tmp/test_file.csv")
In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 0.24s = 0.0m = 0.0h
In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 0.31s = 0.0m = 0.0h
In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time : 140.48s = 2.3m = 0.0h
But with vaex 4.14, arrow & parquet export show significant slow down.
In [1]: import vaex
In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.14.0',
'vaex-viz': '0.5.4',
'vaex-hdf5': '0.13.0',
'vaex-server': '0.8.1',
'vaex-astro': '0.9.3',
'vaex-jupyter': '0.8.1',
'vaex-ml': '0.18.1'}
In [3]: df = vaex.open("/tmp/test_file.csv")
In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 76.80s = 1.3m = 0.0h
In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 79.64s = 1.3m = 0.0h
In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time : 274.33s = 4.6m = 0.1h
This means that we can't work around the slow HDF5 export of wide dataframes by using arrow or parquet. I would love to see this resolved because vaex seems like a good option for my use case.
Thanks,
Description Files with a significant amount of columns seem to freeze, while trying to convert to HDF5. On a test file with 5000 columns and 10 rows, conversion to arrow takes 0.19s, convertion to parquet 0.25s, while hdf5 seems to progress quite slowly.
It seems most of the delay is originating from this line: https://github.com/vaexio/vaex/blob/633970528cb5091ef376dbca2e4721cd42525419/packages/vaex-hdf5/vaex/hdf5/writer.py#L73
Software information
import vaex; vaex.__version__)
:Additional information I have uploaded a dataset to help reproduce this issue. test_file.csv