Closed Columbus240 closed 3 days ago
I can reproduce in R but not python. The code related to Parquet files has changed a lot upstream so it's likely they have already fixed it but it hasn't been available in a release of rust-polars yet (same as #1246). I'll mark this upstream for now and we'll see if this is resolved after the next rust-polars release.
I can reproduce in R but not python.
How about Python Polars 1.7.1? That should be the same code base of R Polars 0.20.0.
I can reproduce with 1.7.1. This must have been fixed upstream
Thanks for the quick response. In that case I'll try to work around the problems and eagerly await the new releases of Rust Polars and R Polars.
I have looked at the new Rust release and it appears that there are about 70 compile errors that need significant changes to fix. I probably won't work on it so if anyone could do it it would be greatly appreciated. (The current code base is so far removed from Python Polars that fixing it in this situation would be very difficult, so work on #1152 is a priority.)
I'm confused about your message, do you prefer pushing the rewrite in the next release or have another "standard" release here with rust polars 0.44?
I'm confused about your message, do you prefer pushing the rewrite in the next release or have another "standard" release here with rust polars 0.44?
I wanted to update the current main branch to polars 0.44 if possible, but I gave up because it seemed like too much work.
My question was more about whether we have another release before the one containing the rewrite. If we don't, I don't see the point of updating main here if it's gonna be replaced anyway. I'd rather work on the rewrite as well.
I see, in that case I was going to do the release because there is nothing preventing me from doing so. It's just that I don't know if I can include the Rust Polars update there.
@Columbus240 Could you try the new version?
The problem is gone with polars
0.21.0
Similar to #1246, I detected another problem in the streaming engine. This time it is very important for the bug to appear, to start with
$scan_parquet(…)
and to end with$sink_parquet(…)
. If the whole dataframe is materialized in memory, the problem does not appear.A minimal example follows. Please play around with the number 3 670 015. On my machine, this is the lowest number where the problem still occurs.
I believe, that some code related to Parquet is involved in the bug, because the following code does not exhibit the problem, even if the number of repetitions is increased tenfold.
On the other hand, if the input is read as Parquet and the output is stored in CSV, the problem persists. This can be seen in the following example:
The output of
polars_info()
:Edit: fixed the reference to the other issue. Edit2: It is possible to replace
$group_by('A')$agg()
byunique()
and detect the same problem.