chroe: refactor `ArrowDataFrame.with_columns` - Githubissues

narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

https://narwhals-dev.github.io/narwhals/

MIT License

607 stars 90 forks source link

chroe: refactor `ArrowDataFrame.with_columns` #1345

Closed FBruzzesi closed 1 week ago

FBruzzesi commented 1 week ago

What type of PR is this? (check all applicable)

[x] 💾 Refactor
[ ] ✨ Feature
[ ] 🐛 Bug Fix
[ ] 🔧 Optimization
[ ] 📝 Documentation
[ ] ✅ Test
[ ] 🐳 Other

Checklist

[x] Code follows style guide (ruff)
[x] Tests added
[x] Documented the changes

If you have comments or can explain your changes, please do so below.

While profiling for plotly, py-spy indicates that we spend a lot of time in validate_dataframe_comparand for pyarrow case. This is called only in with_columns methods.

This PR proposed two changes:

Creates constant array via np.full in validate_dataframe_comparand instead of [const] * length
Changes the logic in with_columns to use pyarrow native methods to insert a column value. I expect this to be faster than the current approach of concatenating the already existing columns with the new ones - caveat is if the number of new columns is order(s) of magnitude greater than the existing ones. In majority of scenarios I would expect this to not be the case, but this is the reason I am opening this as RFC

MarcoGorelli commented 1 week ago

nice - there may be some issue on py38, but the rest looks good!

should we wait until altair / marimo's cis are fixed to merge?

FBruzzesi commented 1 week ago

Fixed old versions, what's the deal with TPCH taking so long now 🙈?

MarcoGorelli commented 1 week ago

🤔 looks like yesterday it went from 2 minutes to 15 minutes

MarcoGorelli commented 1 week ago

ok it was dask, i've removed them from the tpch ci and opened an issue on their repo

FBruzzesi commented 1 week ago

did you test this against the plotly branch locally?

Yes, not big of a change as I would expect though - e.g. I would expect that np.full(1_000_000, 1) to be visibly faster than [1] * 1_000_000.

Edit: Is there a way to run it in isolation with your kaggle notebook? I can clone that

MarcoGorelli commented 1 week ago

yeah where there's pip install git+https://github.com/narwhals-dev/narwhals change that to for example pip install git+https://github.com/narwhals-dev/narwhals@perf/pyarrow-with-columns

FBruzzesi commented 1 week ago

@MarcoGorelli for 1M rows, 50 columns I cannot see any changes in performance for:

only editing existing columns
only adding columns
editing and adding columns in the same with_columns statement

both when working with chunked array and scalars. The two approaches seem to be equivalent. I leave it up to you if this syntax is any better than the previous one

MarcoGorelli commented 1 week ago

thanks for checking! i think I prefer this one, if you agree let's ship it