TypeError: Shapes do not match

nmweizi commented 3 years ago

mysql 8.0
select count(*) from test;
990000

for df in pd.read_sql('select * from test', chunksize=10000,con=conn):
    df_v = vaex.from_pandas(df)
    df_v.export_hdf5('/tmp/test.h5',mode='a',chunk_size=10000)

ERROR:MainThread:vaex.hdf5.writer:error creating dataset for 'id', with type int64 
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 70, in layout
    self.column_writers[name] = ColumnWriterPrimitive(self.columns, name, dtypes[name], shape, has_null[name], self.byteorder)
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 147, in __init__
    self.array = self.h5group.require_dataset('data', shape=shape, dtype=dtype.numpy.newbyteorder(byteorder))
  File "/usr/local/lib/python3.9/site-packages/h5py/_hl/group.py", line 239, in require_dataset
    raise TypeError("Shapes do not match (existing %s vs new %s)" % (dset.shape, shape))
TypeError: Shapes do not match (existing (10000,) vs new (9900,))
ERROR:MainThread:vaex.hdf5.writer:error creating dataset for 'a1', with type string 
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 68, in layout
    self.column_writers[name] = ColumnWriterString(self.columns, name, dtypes[name], shape, str_byte_length[name], has_null_str[name])
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 202, in __init__
    self.array = self.h5group.require_dataset('data', shape=data_shape, dtype='S1')
  File "/usr/local/lib/python3.9/site-packages/h5py/_hl/group.py", line 239, in require_dataset
    raise TypeError("Shapes do not match (existing %s vs new %s)" % (dset.shape, shape))
TypeError: Shapes do not match (existing (30000,) vs new (array(29700),))
ERROR:MainThread:vaex.hdf5.writer:error creating dataset for 'a2', with type string 
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 68, in layout
    self.column_writers[name] = ColumnWriterString(self.columns, name, dtypes[name], shape, str_byte_length[name], has_null_str[name])
  File "/usr/local/lib/python3.9/site-packages/vaex/hdf5/writer.py", line 202, in __init__
    self.array = self.h5group.require_dataset('data', shape=data_shape, dtype='S1')
  File "/usr/local/lib/python3.9/site-packages/h5py/_hl/group.py", line 239, in require_dataset
    raise TypeError("Shapes do not match (existing %s vs new %s)" % (dset.shape, shape))
TypeError: Shapes do not match (existing (30000,) vs new (array(29700),))

Software information

Vaex version (import vaex; vaex.__version__): {'vaex': '4.5.0', 'vaex-core': '4.5.1', 'vaex-viz': '0.5.0', 'vaex-hdf5': '0.10.0', 'vaex-server': '0.6.1', 'vaex-astro': '0.9.0', 'vaex-jupyter': '0.6.0', 'vaex-ml': '0.14.0'}
Vaex was installed via: pip
OS: macos 12

JovanVeljanoski commented 3 years ago

Please read the docstrings of export_hdf5. Here is the key part:

:param str group: Write the data into a custom group in the hdf5 file.
:param str mode: If set to "w" (write), an existing file will be overwritten. If set to "a", one can append additional data to the hdf5 file, but it needs to be in a different group.

So an example would be

for i in range(3):
    df = vaex.example()
    df.export_hdf5('./tmp.hdf5', mode='a', group=str(i))

But then when opening the file you must specify the group:

# for example
vaex.open('./tmp.hdf5', group='1')

If your goal is to convert a database into hdf5 so you can better work with vaex, it is easier (and recommended) to export each chunk to disk, then concatenate all those dataframes and export to a single file (you don't have to, but gives a bit better performance). This process is described in more detail elsewhere on this issue board.

If you wanna continue with a single hdf5 file following my example (i assume your original idea), it will require some more custom code before you are able to use all the data. Which is perfectly fine if you wanna go that route.

I hope this helps!

nmweizi commented 3 years ago

thank you very much. @JovanVeljanoski

vaexio / vaex

TypeError: Shapes do not match #1692