modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.77k stars 651 forks source link

whats the fastest way to add a new column that already has the same partitions (probably)? #7391

Open Liquidmasl opened 2 weeks ago

Liquidmasl commented 2 weeks ago

There are a bunch of ways to add a column to a dataframe..

what is the fastest with modin?

say get a new column by applying a function to another one

new_c = df['column'].apply(lambda x: abs(x))

the resulting series should have the same partitions as the dataframe right?

we can use... merge, or concat, or just do

df['new_col'] = new_c

which is the most readable IMO

and probably a few other ways

but what is the fastest?

Thank you!

Liquidmasl commented 2 weeks ago

And also:

How to add multple columns at once?

concat ? will it play nice with partitions?

cause

df[['col1','col2']] = <some np array with 2 columns and the corrent amount of rows>

just defaults to pandas... because inserting with unhashable key is not supported..?

I dont want to make a new modin dataframe out of the np array for concatenation because i dont want to cause trouble with partitions that dont fit.