vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

hdf5 with multidimensional arrays #276

Open neuralvis opened 5 years ago

neuralvis commented 5 years ago

I am trying to read an hdf5 file that contains a scalar field arranged in a 3D rectangular grid. Is there a suggested way to load the data in vaex ?

[srinivm] $ h5ls data_7.500E-04.h5/data
CH2                      Dataset {1024, 1024, 1024}
CH2(S)                   Dataset {1024, 1024, 1024}
CH2O                     Dataset {1024, 1024, 1024}
CH3                      Dataset {1024, 1024, 1024}
CH4                      Dataset {1024, 1024, 1024}
CO                       Dataset {1024, 1024, 1024}

I tried the recommended approach from the documentation, but it seems that the data frame constructed by vaex only recognizes 1024 rows ?

Screen Shot 2019-05-28 at 8 16 05 PM
maartenbreddels commented 5 years ago

Hi,

thanks for your patience :) Vaex does not really support multidimensional data, although I have nothing against supporting it. Would you like to 'flatten' the data instead? meaning it is will have 1024**3 rows? I'd do that using (untested code):

array_dict = {name: ar.reshape(-1) for name, ar in df.columns.items()}
df_flat = vaex.from_arrays(**array_dict)

Regards,

Maarten

PS: reshape is a bit safer than ravel in not making copies