Open AltamashRafiq opened 2 weeks ago
I am unable to reproduce the discrepancy between the two queries.
Can you share your problem with original data?
I cannot provide the original data as it is customer confidential. I'll investigate the issue more on Monday and hopefully can recreate with a reproducible example. Sorry for not having one at the time of this post :(
Checks
Reproducible example
I have been unable to reproduce this example without using my custom data. Below is a dataset that looks like my existing data (same typings) but it not the same. I have been unable to replicate the issue with this fake data.
Log output
No response
Issue description
Using data.filter(data["col_727"] == 1) to filter this column executes in 0.5s. However, filtering with data.filter(pl.col("col_727") == 1) executes in 13.6s. The column is a float64 column with values 0.0 and 1.0 and no null values. The ratio of 0.0 to 1.0 is 2:1 as in the fake data I've shared. What might be causing this sizable discrepancy? Could it be that polars is not properly distributing compute across cpus with pl.col?
Expected behavior
Execution times are the same or very similar.
Installed versions