Open lostmygithubaccount opened 3 months ago
this was very nasty...it took me a while to figure out what changed (adding that drop columns thing) -- I'm going to move forward by just removing sink_parquet
in my code in favor of collect().write_parquet()
, as it was causing other issues too
the columns not being dropped was particularly surprising. but if overall this is something y'all are addressing in the new streaming engine or there's something obvious here of course feel free to close
and one more note, I also observed this in 0.20.30
-- I upgraded to see if it fixed itself
sink_parquet
uses the streaming engine, whereas collect().write_parquet()
uses the in-memory engine. We are completely redesigning the streaming engine and will likely not improve the performance of the current one (it will be discontinued).
We don't recommend using the streaming engine at the moment (if it works for you great), but we are not happy with it. If you use it for benchmarking, I think you should make clear that it is polars-streaming you are benchmarking. ;)
On the mentioned bug, can you create an MWE that only shows the bug and can be repeated on a small dataset without any dependencies?
that's helpful, thanks! I'll probably avoid streaming for this but keep it flexible so we can redo it once the new engine is out (and clearly note what we're using) -- btw I'll share the benchmark in the communities before publishing for any other feedback or corrections (hopefully this week)
I'll also try to reproduce this on smaller data with a better MWE (but won't be a high priority for me), feel free to close this out
Checks
Reproducible example
this is slightly involved but you should be able to copy/paste below after
pip install 'ibis-framework[duckdb]'
in addition to having Polars installed. I am on the latest release of Polars (0.20.31). this breaks down at sf=20, works fine on sf=10function to generate the data:
run for
sf=10
andsf=20
:now we can read the data:
you'll notice Polars by default (and Ibis on the DuckDB/Polars backends) creates the hive-partitioned
sf
andn
as columns in the data. this was throwing some things off, so in my longerget_polars_tables
function I dropped those columns:you can then read in the tables:
perhaps a separate bug but I'll move forward -- at this point the dataframes still have the
n
andsf
columns, even though they should have been dropped. this does not seem to be an issue in the eager APInow we define q7:
and run it, calling
sink_parquet
on the result:at
sf=10
it works fine, but atsf=20
it hangs for a very long time. it also uses 100% CPU while doing thisas I'm writing this it actually did finish -- while
.collect().write_parquet()
takes 1.5s atsf=20
, thesink_parquet
call takes ~9s atsf=10
and ~60s atsf=20
I was originally testing this at
sf=50
andsf=100
so assumed it was hanging forever, particularly compared to the previous numbers I was seeing before I added those drop column calls. I'll still submit thisNOTE: the log output below was too long (
parquet file must be read...
) for GitHub so I deleted a bunch of it, that seems like it'd be the issue though (reading the parquet file(s) a ton of times?)Log output
Issue description
two potential issues:
noticed columns aren't dropped for LazyFrames when they should be (and are for regular DataFrames)
potential performance issue involving dropping columns +
sink_parquet
columns are dropped
no performance issue w/ the above (I can work around this w/
.collect().write_parquet
it seems)Expected behavior
.collect().write_parquet
it seems)Installed versions