Closed FBruzzesi closed 1 week ago
nice - there may be some issue on py38, but the rest looks good!
should we wait until altair / marimo's cis are fixed to merge?
Fixed old versions, what's the deal with TPCH taking so long now π?
π€ looks like yesterday it went from 2 minutes to 15 minutes
ok it was dask, i've removed them from the tpch ci and opened an issue on their repo
did you test this against the plotly branch locally?
Yes, not big of a change as I would expect though - e.g. I would expect that np.full(1_000_000, 1)
to be visibly faster than [1] * 1_000_000
.
Edit: Is there a way to run it in isolation with your kaggle notebook? I can clone that
yeah where there's pip install git+https://github.com/narwhals-dev/narwhals
change that to for example pip install git+https://github.com/narwhals-dev/narwhals@perf/pyarrow-with-columns
@MarcoGorelli for 1M rows, 50 columns I cannot see any changes in performance for:
with_columns
statementboth when working with chunked array and scalars. The two approaches seem to be equivalent. I leave it up to you if this syntax is any better than the previous one
thanks for checking! i think I prefer this one, if you agree let's ship it
What type of PR is this? (check all applicable)
Checklist
If you have comments or can explain your changes, please do so below.
While profiling for plotly, py-spy indicates that we spend a lot of time in
validate_dataframe_comparand
for pyarrow case. This is called only inwith_columns
methods.This PR proposed two changes:
np.full
invalidate_dataframe_comparand
instead of[const] * length
with_columns
to use pyarrow native methods to insert a column value. I expect this to be faster than the current approach of concatenating the already existing columns with the new ones - caveat is if the number of new columns is order(s) of magnitude greater than the existing ones. In majority of scenarios I would expect this to not be the case, but this is the reason I am opening this as RFC