I have an original file with 100M lines. I create a dfv by importing it from .csv via vaex.from_csv. I filter some of the data frame according to certain conditions to create dfv_filtered. I run groupby and aggregate via sum on one of the columns. This runs fine in about ~10 sec.
I now take dfv_filtered, and cast one of its columns to an array via dfc_filtered.x.values. I transform this array into a numpy array and manipulate it to my liking, then add it to dfv_filtered. I do so via dfv_filtered['new column'] = name_of_np_array. I then create yet another column by multipliying dfv_filtered['new_column'] * dfv_filtered['existing_column']. Now when I run groupby it takes several minutes. I don't understand why. The dtypes are all the same, the dataframe seems virtual still, why would it take much longer?
If I simply take dfv_filtered and copy one of its existing columns over and over and add it as a new column each time, and then run groupby, it still runs ~10 sec.
Which step of my process is the one making it slower?
I have an original file with 100M lines. I create a dfv by importing it from .csv via vaex.from_csv. I filter some of the data frame according to certain conditions to create dfv_filtered. I run groupby and aggregate via sum on one of the columns. This runs fine in about ~10 sec.
I now take dfv_filtered, and cast one of its columns to an array via dfc_filtered.x.values. I transform this array into a numpy array and manipulate it to my liking, then add it to dfv_filtered. I do so via dfv_filtered['new column'] = name_of_np_array. I then create yet another column by multipliying dfv_filtered['new_column'] * dfv_filtered['existing_column']. Now when I run groupby it takes several minutes. I don't understand why. The dtypes are all the same, the dataframe seems virtual still, why would it take much longer?
If I simply take dfv_filtered and copy one of its existing columns over and over and add it as a new column each time, and then run groupby, it still runs ~10 sec.
Which step of my process is the one making it slower?