vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] Export to HDF5 writes an empty (and unreadable) file to disk upon export fail #1947

Open muonmax opened 2 years ago

muonmax commented 2 years ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description When exporting a DataFrame to hdf5 format, if the export fails, a small (empty?) and unreadable hdf5 file is written to disk despite the call failing.

Software information

Additional information To reproduce:

  1. Create an empty Pandas DataFrame:

    import Pandas as pd
    empty_df = pd.DataFrame([], columns = ["A", "B"])
  2. Create vaex DataFrame

    import vaex
    empty_vaex = vaex.from_pandas(empty_df)
  3. Export vaex DataFrame to hdf5

    empty_vaex.export("empty_hdf5.hdf5") # NOTE: also occurs with export_hdf5

    Observe expected error:

    
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-42-4873481ec44b> in <module>
    ----> 1 vaex.from_pandas(empty_df).export("empty.hdf5")

~/.local/lib/python3.8/site-packages/vaex/dataframe.py in export(self, path, progress, chunk_size, parallel, fs_options, fs) 6593 self.export_feather(path, parallel=parallel, fs_options=fs_options) 6594 elif naked_path.endswith('.hdf5'): -> 6595 self.export_hdf5(path, progress=progress, parallel=parallel) 6596 elif naked_path.endswith('.fits'): 6597 self.export_fits(path, progress=progress)

~/.local/lib/python3.8/site-packages/vaex/dataframe.py in export_hdf5(self, path, byteorder, progress, chunk_size, parallel, column_count, writer_threads, group, mode) 6806 progressbar_write = progressbar.add("write data") 6807 with Writer(path=path, group=group, mode=mode, byteorder=byteorder) as writer: -> 6808 writer.layout(self, progress=progressbar_layout) 6809 writer.write( 6810 self,

~/.local/lib/python3.8/site-packages/vaex/hdf5/writer.py in layout(self, df, progress) 46 N = len(df) 47 if N == 0: ---> 48 raise ValueError("Cannot layout empty table") 49 column_names = df.get_column_names() 50

ValueError: Cannot layout empty table


5. Observe that a file `empty_hdf5.hdf5` is created. Attempting to load the file with `varex.open("empty_hdf5.hdf5")` produces the following error:

TypeError Traceback (most recent call last) ~/.local/lib/python3.8/site-packages/IPython/core/formatters.py in call(self, obj) 343 method = get_real_method(obj, self.print_method) 344 if method is not None: --> 345 return method() 346 return None 347 else:

~/.local/lib/python3.8/site-packages/vaex/dataframe.py in _reprhtml(self) 4116 """Representation for Jupyter.""" 4117 self._output_css() -> 4118 return self._head_and_tail_table() 4119 4120 def str(self):

~/.local/lib/python3.8/site-packages/vaex/dataframe.py in _head_and_tail_table(self, n, format) 3863 n = n or vaex.settings.display.max_rows 3864 N = _len(self) -> 3865 if N <= n: 3866 return self._as_table(0, N, format=format) 3867 else:

TypeError: '<=' not supported between instances of 'NoneType' and 'int'

maartenbreddels commented 2 years ago

Thanks for reporting.

There are two issues.