vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] Slow HDF5 conversion of file with large number of columns #2154

Open grafail opened 2 years ago

grafail commented 2 years ago

Description Files with a significant amount of columns seem to freeze, while trying to convert to HDF5. On a test file with 5000 columns and 10 rows, conversion to arrow takes 0.19s, convertion to parquet 0.25s, while hdf5 seems to progress quite slowly.

import vaex

df = vaex.open("test_file.csv")
df.export_arrow("test_file.arrow", progress=True)
df.export_parquet("test_file.parquet", progress=True)
df.export_hdf5("test_file.hdf5", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :     0.20s =  0.0m =  0.0h
export(arrow) [########################################] 100.00% elapsed time  :     0.24s =  0.0m =  0.0h
export(hdf5) [###############-------------------------] 38.01% estimated time:    49.82s =  0.8m =  0.0h

It seems most of the delay is originating from this line: https://github.com/vaexio/vaex/blob/633970528cb5091ef376dbca2e4721cd42525419/packages/vaex-hdf5/vaex/hdf5/writer.py#L73

Software information

Additional information I have uploaded a dataset to help reproduce this issue. test_file.csv

JovanVeljanoski commented 2 years ago

Thanks - good catch! Let's see if we can improve it.

PRs are welcome of course!

ttk-kstn commented 1 year ago

Hi, I also have a use case with a lot of columns. I tried to reproduce this issue in my environment, and observed that after vaex 4.14, even arrow and parquet exports are much slower.

I used python 3.9.18 on Ubuntu 22.04 and Windows 10. I installed vaex with conda using conda-forge channel.

With vaex 4.13, only HDF5 export is slow. (Sorry for pasting 4.12.0 results. I copied from wrong terminal.)

In [1]: import vaex

In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.12.0',
 'vaex-viz': '0.5.4',
 'vaex-hdf5': '0.12.3',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.3',
 'vaex-jupyter': '0.8.1',
 'vaex-ml': '0.18.1'}

In [3]: df = vaex.open("/tmp/test_file.csv")

In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :     0.24s =  0.0m =  0.0h

In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :     0.31s =  0.0m =  0.0h

In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time  :   140.48s =  2.3m =  0.0h

But with vaex 4.14, arrow & parquet export show significant slow down.


In [1]: import vaex

In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.14.0',
 'vaex-viz': '0.5.4',
 'vaex-hdf5': '0.13.0',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.3',
 'vaex-jupyter': '0.8.1',
 'vaex-ml': '0.18.1'}

In [3]: df = vaex.open("/tmp/test_file.csv")

In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :    76.80s =  1.3m =  0.0h

In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :    79.64s =  1.3m =  0.0h

In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time  :   274.33s =  4.6m =  0.1h

This means that we can't work around the slow HDF5 export of wide dataframes by using arrow or parquet. I would love to see this resolved because vaex seems like a good option for my use case.

Thanks,