pola-rs / r-polars

Polars R binding
https://pola-rs.github.io/r-polars/
Other
479 stars 36 forks source link

Bug in streaming engine involving Parquet #1262

Closed Columbus240 closed 3 days ago

Columbus240 commented 1 month ago

Similar to #1246, I detected another problem in the streaming engine. This time it is very important for the bug to appear, to start with $scan_parquet(…) and to end with $sink_parquet(…). If the whole dataframe is materialized in memory, the problem does not appear.

A minimal example follows. Please play around with the number 3 670 015. On my machine, this is the lowest number where the problem still occurs.

library(polars)

# 3'670'014 is too low. Computation is correct.
# 3'670'015 is just high enough. Computation is wrong.
df_part1 = pl$DataFrame(A = rep('1', 3670015))
df_part2 = pl$DataFrame(A = c('2'))

pl$concat(df_part1, df_part2)$write_parquet('test_0.parquet')

df_agg = pl$scan_parquet('test_0.parquet')$group_by('A')$agg()
df_agg$sink_parquet('test_1.parquet')
df_agg_out = pl$read_parquet('test_1.parquet')
df_agg_c = df_agg$collect()

df_agg_out$height # returns 1 but should be 2
df_agg_c$height # returns 2
df_agg_out
df_agg_c

I believe, that some code related to Parquet is involved in the bug, because the following code does not exhibit the problem, even if the number of repetitions is increased tenfold.

library(polars)

df_part1 = pl$DataFrame(A = rep('1', 3670015))
df_part2 = pl$DataFrame(A = c('2'))

pl$concat(df_part1, df_part2)$write_csv('test_0.csv')

df_agg = pl$scan_csv('test_0.csv')$group_by('A')$agg()
df_agg$sink_parquet('test_1.parquet')
df_agg_out = pl$read_parquet('test_1.parquet')
df_agg_c = df_agg$collect()

df_agg_out$height # returns 2
df_agg_c$height # returns 2
df_agg_out
df_agg_c

On the other hand, if the input is read as Parquet and the output is stored in CSV, the problem persists. This can be seen in the following example:

library(polars)

df_part1 = pl$DataFrame(A = rep('1', 3670015))
df_part2 = pl$DataFrame(A = c('2'))

pl$concat(df_part1, df_part2)$write_parquet('test_0.parquet')

df_agg = pl$scan_parquet('test_0.parquet')$group_by('A')$agg()
df_agg$sink_csv('test_1.csv')
df_agg_out = pl$read_csv('test_1.csv')
df_agg_c = df_agg$collect()

df_agg_out$height # returns 1 but should be 2
df_agg_c$height # returns 2
df_agg_out
df_agg_c

The output of polars_info():

Polars R package version : 0.20.0
Rust Polars crate version: 0.43.1

Thread pool size: 14 

Features:                               
default                    TRUE
full_features              TRUE
disable_limit_max_threads  TRUE
nightly                    TRUE
sql                        TRUE
rpolars_debug_print       FALSE

Code completion: deactivated

Edit: fixed the reference to the other issue. Edit2: It is possible to replace $group_by('A')$agg() by unique() and detect the same problem.

etiennebacher commented 1 month ago

I can reproduce in R but not python. The code related to Parquet files has changed a lot upstream so it's likely they have already fixed it but it hasn't been available in a release of rust-polars yet (same as #1246). I'll mark this upstream for now and we'll see if this is resolved after the next rust-polars release.

eitsupi commented 1 month ago

I can reproduce in R but not python.

How about Python Polars 1.7.1? That should be the same code base of R Polars 0.20.0.

etiennebacher commented 1 month ago

I can reproduce with 1.7.1. This must have been fixed upstream

Columbus240 commented 1 month ago

Thanks for the quick response. In that case I'll try to work around the problems and eagerly await the new releases of Rust Polars and R Polars.

eitsupi commented 3 weeks ago

I have looked at the new Rust release and it appears that there are about 70 compile errors that need significant changes to fix. I probably won't work on it so if anyone could do it it would be greatly appreciated. (The current code base is so far removed from Python Polars that fixing it in this situation would be very difficult, so work on #1152 is a priority.)

etiennebacher commented 3 weeks ago

I'm confused about your message, do you prefer pushing the rewrite in the next release or have another "standard" release here with rust polars 0.44?

eitsupi commented 3 weeks ago

I'm confused about your message, do you prefer pushing the rewrite in the next release or have another "standard" release here with rust polars 0.44?

I wanted to update the current main branch to polars 0.44 if possible, but I gave up because it seemed like too much work.

etiennebacher commented 3 weeks ago

My question was more about whether we have another release before the one containing the rewrite. If we don't, I don't see the point of updating main here if it's gonna be replaced anyway. I'd rather work on the rewrite as well.

eitsupi commented 3 weeks ago

I see, in that case I was going to do the release because there is nothing preventing me from doing so. It's just that I don't know if I can include the Rust Polars update there.

eitsupi commented 5 days ago

@Columbus240 Could you try the new version?

etiennebacher commented 3 days ago

The problem is gone with polars 0.21.0