vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

Slow groupby after adding column from array #2252

Open statsrunner opened 1 year ago

statsrunner commented 1 year ago

I have an original file with 100M lines. I create a dfv by importing it from .csv via vaex.from_csv. I filter some of the data frame according to certain conditions to create dfv_filtered. I run groupby and aggregate via sum on one of the columns. This runs fine in about ~10 sec.

I now take dfv_filtered, and cast one of its columns to an array via dfc_filtered.x.values. I transform this array into a numpy array and manipulate it to my liking, then add it to dfv_filtered. I do so via dfv_filtered['new column'] = name_of_np_array. I then create yet another column by multipliying dfv_filtered['new_column'] * dfv_filtered['existing_column']. Now when I run groupby it takes several minutes. I don't understand why. The dtypes are all the same, the dataframe seems virtual still, why would it take much longer?

If I simply take dfv_filtered and copy one of its existing columns over and over and add it as a new column each time, and then run groupby, it still runs ~10 sec.

Which step of my process is the one making it slower?

maartenbreddels commented 1 year ago

Too me this also doesn't make sense. Could you make a reproducible example, (just using np.arange etc)?