Closed ion-elgreco closed 7 months ago
@stinodego Here is a reproducible example π
import datetime
import random
def random_date(start, end):
"""Generate a random datetime between `start` and `end`"""
return start + datetime.timedelta(
# Get a random amount of seconds between `start` and `end`
seconds=random.randint(0, int((end - start).total_seconds())),
)
df = pl.DataFrame({
"id":list(range(0,1000))*1500,
"start_date": [random_date(datetime.datetime(2015,1,1), datetime.datetime(2020,1,1),) for i in range(1500000)]
}).with_columns(end_date = pl.col('start_date') + pl.duration(hours=random.randint(24,240)))
parts = df.sample(1700).select(
'id',
pl.concat_list(pl.col('start_date','end_date')).list.mean().alias('date').cast(pl.Datetime)
)
Execute polars code:
df_polars_only = df.lazy().join(
parts.lazy(),
how='cross'
).filter(
(pl.col('id') == pl.col('id')) &
(pl.col('start_date') <= pl.col('date') ) &
(pl.col('end_date') >= pl.col('date'))
).collect(streaming=True)
timings: 16.7 s Β± 1.98 s per loop (mean Β± std. dev. of 7 runs, 1 loop each)
Now execute in DuckDB:
sqlcode = """
SELECT *
FROM df
CROSS JOIN parts
WHERE
df.id== parts.id
and df.start_date<= parts.date
and df.end_date>= parts.date
"""
duckdb.sql(sqlcode).pl()
Timings: 18.7 ms Β± 320 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)
This is why Polars really needs non-equi joins. DuckDB's EXPLAIN
will tell you that it converts this cross join to an inner join:
βββββββββββββββββββββββββββββ
β PROJECTION β
β β β β β β β β β β β β β
β id β
β start_date β
β end_date β
β id β
β date β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β HASH_JOIN β
β β β β β β β β β β β β β
β INNER β
β id = id β
β end_date >= date ββββββββββββββββ
β start_date <= date β β
β β β β β β β β β β β β β β
β EC: 1 β β
β Cost: 1 β β
βββββββββββββββ¬ββββββββββββββ β
βββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ
β ARROW_SCAN ββ ARROW_SCAN β
β β β β β β β β β β β β ββ β β β β β β β β β β β β
β id ββ id β
β start_date ββ date β
β end_date ββ β β β β β β β β β β β β
β β β β β β β β β β β β ββ EC: 1 β
β EC: 1 ββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
To get it, just use:
sqlcode = """
EXPLAIN
SELECT *
FROM df
CROSS JOIN parts
WHERE
df.id== parts.id
and df.start_date<= parts.date
and df.end_date>= parts.date
"""
print(duckdb.sql(sqlcode).pl().get_column("explain_value").to_list()[0])
Underlying issue is https://github.com/pola-rs/polars/issues/10068.
@avimallu right, then I can just rewrite it as an inner join ^^
But not as an inner join on inequality conditions, since Polars doesn't support those yet, right?
(Don't know if an inner non-equi join has a specific name.)
Just doing an inner join on ID first and then filter afterwards is giving the same results
Polars doesn't have non-equi joins yet. There is a tracking issue #10068
Checks
Reproducible example
See comment below.
Log output
No response
Issue description
Cross joins with polars are lot's slower than in duckdb, with streaming is the only way to get a result. If I don't project less columns in df with a select than it will take 10+ minutes, if I just select the relevant columns than it will take only 40 second on 0.20.18 (in 0.20.10 it was double, so that improved).
This will take 10+ minutes
With DuckDB on Polars Dataframes:
Expected behavior
Be as fast as DuckDB xD
Installed versions