Open kszlim opened 8 months ago
Minimal repro:
df = pl.DataFrame(
{'id': [67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67,
67],
'time_ns': [15016000000,
15126000000,
15236000000,
15346000000,
15456000000,
15566000000,
15676000000,
15786000000,
15896000000,
16006000000,
16116000000,
16226000001,
16336000001,
16446000001,
16556000000,
16666000000,
16776000000,
16886000001,
16996000001,
17106000001]}
).set_sorted("time_ns")
df.group_by_dynamic("time_ns", every="10i", check_sorted=False).agg(pl.col("id").alias("group"))
determining the groups takes a long time
if you're making groups every 10 units, and your measurements span 2 billion units, then that's a lot of groups...there's probably some fastpath which could be introduced to avoid creating a lot of them though
Yes, we seem to iterate A LOT! Care to look a that one? Then I will do the pivots. :D
I think this isn't so simple to speedup, there's already an early continue
this may require a larger refactor..
Oh, I didn't realize we went in steps of 10 through 2 billion units. Ok.. :/
Yeah it's the equivalent of having 1 observation every 2 minutes, and then resampling so they're every 10 microseconds..
So I think a slowdown is expected - not saying it's not addressable, but I don't think it's at all common to do this, and so that it's low-prio compared with other open issues
Is there a way to make it work with a time and/or duration datatype? I guess I could convert the column to seconds and then it should work fine with indices?
Regardless of what dtype you convert it to, if your every
is 8 orders of magnitude smaller than the distance between points, then there's going to be a perf impact
May I ask what your use case is here? I think you may be better of using a different operation (truncate perhaps?)
Just trying to do a lazy downsample within groups.
If you're doing an operation on every 10 elements, you could try something like unstack
although you're going to generate a lot of columns. For this I would almost suggest to_numpy().reshape(-1, 10).mean(axis=1)
or something of the sort.
Regardless of what dtype you convert it to, if your
every
is 8 orders of magnitude smaller than the distance between points, then there's going to be a perf impactMay I ask what your use case is here? I think you may be better of using a different operation (truncate perhaps?)
I'm trying to downsample my data to about 50hz, but my data isn't labeled by timestamp and instead is just some sort of monotonic clock from a given epoch.
Checks
Reproducible example
repro.parquet.zip
Unzip the attached parquet and then try to run it.
Log output
Issue description
Pathologically slow, there must be some sort of exponential behavior.
Expected behavior
Should run fast.
Installed versions
This reproduces in versions of polars since 0.19.14 at least. Doesn't seem to change if I write it to parquet before the
group_by_dynamic
and read it again, ie. there's nothing broken about the parquet file encoding.