vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

TypeError: Shapes do not match #1692

Closed nmweizi closed 3 years ago

nmweizi commented 3 years ago
mysql 8.0
select count(*) from test;
990000
for df in pd.read_sql('select * from test', chunksize=10000,con=conn):
    df_v = vaex.from_pandas(df)
    df_v.export_hdf5('/tmp/test.h5',mode='a',chunk_size=10000)
ERROR:MainThread:vaex.hdf5.writer:error creating dataset for 'id', with type int64 
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 70, in layout
    self.column_writers[name] = ColumnWriterPrimitive(self.columns, name, dtypes[name], shape, has_null[name], self.byteorder)
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 147, in __init__
    self.array = self.h5group.require_dataset('data', shape=shape, dtype=dtype.numpy.newbyteorder(byteorder))
  File "/usr/local/lib/python3.9/site-packages/h5py/_hl/group.py", line 239, in require_dataset
    raise TypeError("Shapes do not match (existing %s vs new %s)" % (dset.shape, shape))
TypeError: Shapes do not match (existing (10000,) vs new (9900,))
ERROR:MainThread:vaex.hdf5.writer:error creating dataset for 'a1', with type string 
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 68, in layout
    self.column_writers[name] = ColumnWriterString(self.columns, name, dtypes[name], shape, str_byte_length[name], has_null_str[name])
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 202, in __init__
    self.array = self.h5group.require_dataset('data', shape=data_shape, dtype='S1')
  File "/usr/local/lib/python3.9/site-packages/h5py/_hl/group.py", line 239, in require_dataset
    raise TypeError("Shapes do not match (existing %s vs new %s)" % (dset.shape, shape))
TypeError: Shapes do not match (existing (30000,) vs new (array(29700),))
ERROR:MainThread:vaex.hdf5.writer:error creating dataset for 'a2', with type string 
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 68, in layout
    self.column_writers[name] = ColumnWriterString(self.columns, name, dtypes[name], shape, str_byte_length[name], has_null_str[name])
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 202, in __init__
    self.array = self.h5group.require_dataset('data', shape=data_shape, dtype='S1')
  File "/usr/local/lib/python3.9/site-packages/h5py/_hl/group.py", line 239, in require_dataset
    raise TypeError("Shapes do not match (existing %s vs new %s)" % (dset.shape, shape))
TypeError: Shapes do not match (existing (30000,) vs new (array(29700),))

Software information

JovanVeljanoski commented 3 years ago

Please read the docstrings of export_hdf5. Here is the key part:

:param str group: Write the data into a custom group in the hdf5 file.
:param str mode: If set to "w" (write), an existing file will be overwritten. If set to "a", one can append additional data to the hdf5 file, but it needs to be in a different group.

So an example would be

for i in range(3):
    df = vaex.example()
    df.export_hdf5('./tmp.hdf5', mode='a', group=str(i))

But then when opening the file you must specify the group:

# for example
vaex.open('./tmp.hdf5', group='1')

If your goal is to convert a database into hdf5 so you can better work with vaex, it is easier (and recommended) to export each chunk to disk, then concatenate all those dataframes and export to a single file (you don't have to, but gives a bit better performance). This process is described in more detail elsewhere on this issue board.

If you wanna continue with a single hdf5 file following my example (i assume your original idea), it will require some more custom code before you are able to use all the data. Which is perfectly fine if you wanna go that route.

I hope this helps!

nmweizi commented 3 years ago

thank you very much. @JovanVeljanoski