vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second šŸš€
https://vaex.io
MIT License
8.25k stars 590 forks source link

vaex doesn't scale well with increased number of columns? #1616

Open nitinmnsn opened 2 years ago

nitinmnsn commented 2 years ago

I think vaex is unusable if the number of columns is large. Following 1-row dataframe takes 30 minutes to be converted and read in

import pandas as pd
import vaex as v
import numpy as np

dfs = pd.DataFrame(np.arange(100000).reshape(1,-1))

dfs.shape

dfs.to_csv('test.csv',index=False)

vdf=v.open('test.csv',convert=True)

Am I missing something or should vaex only be used when the number of columns is less. What is the suggested threshold if that is the case?

Ben-Epstein commented 2 years ago

My guess here (not a maintainer) is a combination of the facts that

  1. I believe vaex is optimized for columnar storage, not row storage
  2. When reading from a csv, vaex has to put all of the data into memory, and load from pandas.

With thoseĀ 2 things in mind, try changing the last line to pdf = pd.read_csv('test.csv') and see what happens. From my testing, it seems pandas is also struggling quite a bit to open this file.

image

image

image

This isn't a sufficient answer, and I'd also be interested to understand it better. With that being said, That code ran for me in 3min, not 30, but even so 2 minutes is quite long. The writing took <1 sec, the opening seems to be the challenge. But i'm not sure if this is a vaex specific issue or not.

maartenbreddels commented 2 years ago

Interesting results, yeah, out focus is on tall dataframes, but we can take a look if we can make it perform better with more columns. I'm not sure 100_000 columns make sense for Vaex though.

nitinmnsn commented 2 years ago

My guess here (not a maintainer) is a combination of the facts that

  1. I believe vaex is optimized for columnar storage, not row storage
  2. When reading from a csv, vaex has to put all of the data into memory, and load from pandas.

With thoseĀ 2 things in mind, try changing the last line to pdf = pd.read_csv('test.csv') and see what happens. From my testing, it seems pandas is also struggling quite a bit to open this file.

image

image

image

This isn't a sufficient answer, and I'd also be interested to understand it better. With that being said, That code ran for me in 3min, not 30, but even so 2 minutes is quite long. The writing took <1 sec, the opening seems to be the challenge. But i'm not sure if this is a vaex specific issue or not.

Thank you for investing your time into this @Ben-Epstein.

I do not think that while reading from CSV you have to necessarily put all the data in the memory. You can convert the CSV to HDFS (which vaex does by choosing the appropriate chunks so that they dont result in OOM) and then use that HDFS from then on. Conversion from CSV to HDFS and then reading from that HDFS can be achieved by one single command vdf=v.open('test.csv',convert=True).

30 minutes runtime reported by me is inclusive of converting the CSV to HDFS as well. You are not doing that step. Without HDFS conversion I dont think vaex makes much sense. We do need a memory mapped table on disk that vaex can leverage.

nitinmnsn commented 2 years ago

Interesting results, yeah, out focus is on tall dataframes, but we can take a look if we can make it perform better with more columns. I'm not sure 100_000 columns make sense for Vaex though.

To start with, Thank you tons for bringing to us this great tool. You astrophysicists rockk!!

In my opinion memory problems due to column heavy datasets are much more common than due to row heavy datasets. But, that could be very specific to my work. Nevertheless, I would be overjoyed if vaex made our life easier for wide datasets as well.

JovanVeljanoski commented 2 years ago

Just out of curiosity - what is your usecase that requires 100_000 columns? Or are you simply trying to fing the limitations of the library?

nitinmnsn commented 2 years ago

@JovanVeljanoski We have a relational base containing 14 tables and the relations are as much as 4 levels deep. Applying a crude deep feature synthesis to them with just 2 or 3 transform primitives, a few aggregate primitives and a few training windows would result in the feature explosion of this level.

I could have put in more active management of the number of features created but I thought, I will throw compute at the problem and let a conduit of feature selection methodologies reduce the number of features for me. To make the dataset amenable to feature selection I needed some massaging of the data like categorical column frequency encoding, all null column removal, duplicate feature removal (they do show up in deep feature synthesis) and this is where I thought of using vaex. Vaex does not work on pandas generated HDFS and vaex took a lot of time to convert the dataset to HDFS so I looked for a why. This is what I found

Also, even if I had been very careful with DFS I still would have 4k features per window. I had 1 month, 3 months, 6 months, 12 months, and ever as the training windows so I would still have 20k features which I am now guessing vaex would struggle with

This is why I wanted to know if you guys already are aware of this and would be able to suggest some limits on dataset dimensions that should be kept in mind before deciding to go the vaex way

For now, I have solved the problem using pyspark but, I still wanted to understand and learn vaex.

Ben-Epstein commented 2 years ago

@nitinmnsn I do think that even with chunking, for any particular row vaex will have to read it entirely into memory

Datasets are still commonly stored in text-based file formats such as CSV. Since text-based file formats are not memory-mappable, they have to be read in memory. If the contents of a CSV file fits into the available RAM, one can simply do:

Vaex is using pandas for reading CSV files in the background

then regarding

df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)

When the above line is executed, Vaex will read the CSV in chunks, and convert each chunk to a temporary HDF5 file on disk.

My assumption is that each chunk is read with pandas, which would involve bringing that chunk into memory. So for any given row, all N columns are read into memory

nitinmnsn commented 2 years ago

@Ben-Epstein I think so as well. Partial row reads would not happen. So, if I don't have enough RAM to read in even one row then I have a bigger pickle to deal with. I follow that CSVs are not memory mappable which is why they are to be converted to HDFS. I agree that each chunk gets read into the memory so that the entire file does not have to be read in. But, I think the problem would be while executing

df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)

Irrespective of the chunk size, if the number of columns is too large then it would take an uncomfortably long time to convert './my_data/my_big_file.csv' to HDFS.