vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

hdf5 file not able to read in Vaex, from Azure Blob storage #2148

Open nagarajmmu opened 2 years ago

nagarajmmu commented 2 years ago

Hi

I am using HSDS to create hdf5 file in Azure Blob storage, as below.

fHSDS = h5pyd.File(HSDS_PATH + FILE_NAME, "w") dset_hsds = fHSDS.create_dataset(DATASET_NAME, (NUM_ROWS,NUM_COLS), dtype='float64', maxshape=(None,NUM_COLS), chunks=(CHUNK_SIZE[0], CHUNK_SIZE[1])) for iRow in range(0, NUM_ROWS, CHUNK_SIZE[0]): dset_hsds[iRow:iRow+CHUNK_SIZE[0]-1, :] = randomData[iRow:iRow+CHUNK_SIZE[0]-1, :] fHSDS.close()

Using Vaex, whenever I am trying to read same hdf5 file from blob, using below code, I am getting "_FileNotFoundError: /blob_name/home/testFilefromPython.h5"

df = vaex.open("/blob_name/home/testFile_fromPython.h5", fs=fs)

in above code if I try to read parquet/csv, I am able to read a file using Vaex, as a Data frame.

Same scenario with local: When I am creating hdf5 file in local and read same file using Vaex, I am able to read the hdf5 file as a Data frame.

Please help me, to read hdf5 from Azure blob storage.

Thanks in advance.

JovanVeljanoski commented 2 years ago

Don't know if it should matter, but can you change your extension to .hdf5?

Also, can you please format your code, it is very hard to figure out what is happening right now.