vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

Reading h5ad file fails #373

Open saksham219 opened 5 years ago

saksham219 commented 5 years ago

I am trying to read an h5ad file using

df = vaex.open("../GSE99254_clusters_annotated.h5ad")

However it fails with the following error image

maartenbreddels commented 5 years ago

I'm not sure I can help you, it seems to me that h5ad is some kind of hdf5 format? But I cannot find information on it. What you could do manually, is to extract the numpy arrays from it, and pass that to vaex.

saksham219 commented 5 years ago

Thanks for the suggestion. h5ad is indeed an hdf5 format. (an example h5ad dataset) I was using it through vaex by passing it in as numpy arrays but the problem with that approach is that it loads the complete data in the memory and this is not possible for my dataset of a million rows and thousand columns. I wanted to read it using vaex.open() so that it could use the memory mapping features of hdf5.

JovanVeljanoski commented 5 years ago

Hi @saksham219

Thank for trying out vaex. The hdf5 file format is quite flexible, the tables/data can be stored in various way, thus it is far from simple for vaex to support many of those.

However, I may be able to suggest an alternative, which is converting your h5ad file to vaex hdf5 file. The simplest way you could do this is to read-in as much of your h5ad file in memory as you can, and export it to hdf5 with vaex. Then read in the next N rows and export them to another hdf5 file and so on until you've gone over the entire h5ad file. Once you are done, you can use veax.open_many() to read in all hdf5 files, which will result in a single DataFrame object, just as if you've opened a single large hdf5 file. If you want, you can than export this to produce a single hdf5 file, or continue working with the multiple smaller files.

Alternatively (the same idea really), is to read in and export the data on a column instead of row basis, and then use join to create a single large DataFrame, which you can export or just work with.

I understand that these are not ideal solutions, but are not really that hard or lengthy to implement.

I hope this helps.

saksham219 commented 5 years ago

Hi @JovanVeljanoski

Thanks for all the suggestions. I will try the first one out and try to convert the h5ad format file into vaex hdf5 I just have one question regarding the second method you have suggested. Will loading the data column wise into a vaex dataframe still load it into memory? Since that is the issue I am facing that my dataset size very large to load all of it at once into memory I don't understand how loading it into a dataframe will help.

Thanks!

JovanVeljanoski commented 5 years ago

Hi,

Well actually the two approaches should be similar technically. The power of vaex is that you don't really "open" files in the conventional sense and read them into memory, but with opening you just memory map them. So the data is only read when needed and in chunks so RAM should never be an issue.

So at the end if kind of depends on your usage. If you expect to use only a subset of the columns frequently but will want to have everything at hand, I'd use the 2nd approach. If you think you'll need everything all the time, if use the first one.

Also exporting to a single large hdf5 file at the end, should improve performance a bit.

I hope this helps

maartenbreddels commented 5 years ago

Indeed, what Jovan is saying. By doing it 1 column at a time, or 1 chunk at a time, not everything needs to be in memory. I'll keep the issue open to see if I can support this filetype, thanks for the link to the example dataset. In any case, I hope you managed to convert the file.

saksham219 commented 5 years ago

Hi @maartenbreddels @JovanVeljanoski

Thank you for the help. I was able to make an hdf5 file for my dataset. I was going through the documentation of vaex. It mentions that the function 'vaex.from_arrays()' create an 'in memory DataFrame' from numpy. Does this mean that it creates a memory mapped object or it actually loads the whole dataset with all its arrays in the RAM?