pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.52k stars 1.98k forks source link

Add time-based shifting to .shift method #12740

Open mkleinbort opened 12 months ago

mkleinbort commented 12 months ago

Description

It'd be nice to be able to write

df.with_columns(x_before = pl.col('x').shift('7d', by='date'))

Where 'date' is some other date or datetime column.

There are many edge cases this would need to handle - mainly what to do if there is no unique record '7d' ago in this example.

MarcoGorelli commented 12 months ago

hi @mkleinbort - could you show input with expected output please?

orlp commented 12 months ago

I definitely would not want this to be part of the shift operator. Besides can't you just do this (assuming this is what you meant)?

pl.col('x').filter(pl.col('date') > pl.col('date').min() + pl.duration(days=7))
mkleinbort-ic commented 12 months ago

@MarcoGorelli - here is an example

from datetime import date 

# The values are time-indexed, but the dates are not evenly spaced:
df = pl.DataFrame({
    'date': [date(2020,1,1),date(2020,1,1),date(2020,1,2),date(2020,1,3),date(2020,1,4),date(2020,1,4)],
    'entity':['A','B','A','B','A','B',],
    'value':['A1', 'B1', 'A2', 'B2', 'A3','B3']
})

Then the ask is

df.with_columns(
     value_one_row_before = pl.col('value').shift(1).over('entity'), 
     value_one_day_before = pl.col('value').shift('1d', by='date').over('entity'), 
)

# shape: (6, 5)
# ┌────────────┬────────┬───────┬──────────────────────┬──────────────────────┐
# │ date       ┆ entity ┆ value ┆ value_one_row_before ┆ value_one_day_before │
# │ ---        ┆ ---    ┆ ---   ┆ ---                  ┆ ---                  │
# │ date       ┆ str    ┆ str   ┆ str                  ┆ str                  │
# ╞════════════╪════════╪═══════╪══════════════════════╪══════════════════════╡
# │ 2020-01-01 ┆ A      ┆ A1    ┆ null                 ┆ null                 │
# │ 2020-01-01 ┆ B      ┆ B1    ┆ null                 ┆ null                 │
# │ 2020-01-02 ┆ A      ┆ A2    ┆ A1                   ┆ A1                   │
# │ 2020-01-03 ┆ B      ┆ B2    ┆ B1                   ┆ null                 │
# │ 2020-01-04 ┆ A      ┆ A3    ┆ A2                   ┆ null                 │
# │ 2020-01-04 ┆ B      ┆ B3    ┆ B2                   ┆ B2                   │
# └────────────┴────────┴───────┴──────────────────────┴──────────────────────┘

Note this could be done in other ways in the current API (either re-sampling the dataframe to daily data before using shift, or using a self-join, I just think this API would be pretty clean to work with time-indexed data.

I'm unsure whether this should implement some form of "nearest/earliest/latest" logic when there is no value at the exact date expected.

One way to do it via a join would be:

df.pipe(
    lambda x: x.join(
        x.select(
            'entity',
            pl.col('date').dt.offset_by('1d'), 
            pl.col('value').alias('value_one_day_before')
        ), 
        how='left', on=('date','entity')
  )
)
orlp commented 12 months ago

@mkleinbort-ic

I'm unsure whether this should implement some form of "nearest/earliest/latest" logic

That would be an asof join. https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.join_asof.html#polars.LazyFrame.join_asof

To me this just feels like it is a special case of a join and not something particularly fundamental.

mkleinbort-ic commented 12 months ago

Yes, it's the same logic.

Maybe I should broaden the ask to introducing a pl.Expr.shift_by method, similar to a pl.Expr.sort_by, not to over-complicated the very fundamental .shift() method.

To me this just feels like it is a special case of a join and not something particularly fundamental.

It's pretty important for year-on-year calculations, or any other such expression.

At least grant me code like this is pretty redable:

(df
    .with_columns(
        same_store_sales_yoy_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1y')).over('storeId'),
        same_store_sales_mom_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1mo')).over('storeId'),
        same_store_sales_wow_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1w')).over('storeId'),
    )
)
orlp commented 12 months ago

@mkleinbort-ic I don't find that particularly readable because I have no idea what it's supposed to be doing. At the very least it appears to be missing some sort of grouping/aggregation by period.

Like, what is one row of that df supposed to be, the sales of a single day at the given date of a given store? In such a case it would not be yoy growth, but more like comparing the growth of the 1st of January 2001 compared to the sales of 1st of January 2000. And the existence of 29 Feb 2004 would mess results up with year over year compared to 2005...

Furthermore such a shift would introduce nulls so a large portion of your array would not have any results.

mkleinbort-ic commented 12 months ago

@orlp - you understood the logic of my code well. And of course, there are issues with exact matching of dates for daily sales data, but that is pretty standard and typically handled downstream (forward-fills as good at this, but there are better approaches depending on the usecase).

I don't think we need to get into a discussion of time series analysis for retail data - it suffices to say that sometimes w/w sales (e.g. comparing Monday to Monday) is more important than month-on-month sales (e.g. Feb 3rd to March 3rd), but sometimes y-on-year values are most important (Dec 25th '22 to Dec 25th '23).

The main point I am getting at is that this code:


(df
    .with_columns(
        same_store_sales_yoy_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1y')).over('storeId'),
        same_store_sales_mom_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1mo')).over('storeId'),
        same_store_sales_wow_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1w')).over('storeId'),
        same_store_sales_dod_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1d')).over('storeId'),
    )
)

could be implemented with a few joins (eg. a left join on all calendar days) and the .shiftoperator - but I find my proposed .shift_by is more readable and composable.

MarcoGorelli commented 12 months ago

why not use rolling with say '1y', and then in the aggregation do something like pl.col('sales').last() / pl.col('sales').first()?

mkleinbort-ic commented 12 months ago

Here is some toy data to work thorough it (we are now talking about UFO sightings counts, by UFO shape, by daily)

df = (pl
      .read_csv('https://raw.githubusercontent.com/mkleinbort/resource-datasets/master/ufo_sightings/ufo.csv')
      .with_columns(Datetime = pl.col('Time').str.strptime(pl.Datetime, "%m/%d/%Y %H:%M"))
      .with_columns(Date = pl.col('Datetime').dt.date())
      .group_by('Shape Reported', 'Date').count().rename({'count':'Count'})
      .sort('Date', 'Shape Reported')
)

# shape: (48_165, 3)
# ┌────────────────┬────────────┬───────┐
# │ Shape Reported ┆ Date       ┆ Count │
# │ ---            ┆ ---        ┆ ---   │
# │ str            ┆ date       ┆ u32   │
# ╞════════════════╪════════════╪═══════╡
# │ TRIANGLE       ┆ 1930-06-02 ┆ 1     │
# │ OTHER          ┆ 1930-07-01 ┆ 1     │
# │ OVAL           ┆ 1931-02-16 ┆ 1     │
# │ DISK           ┆ 1931-06-02 ┆ 1     │
# │ …              ┆ …          ┆ …     │
# │ OVAL           ┆ 2014-09-04 ┆ 2     │
# │ TRIANGLE       ┆ 2014-09-04 ┆ 3     │
# │ null           ┆ 2014-09-05 ┆ 1     │
# │ LIGHT          ┆ 2014-09-05 ┆ 3     │
# └────────────────┴────────────┴───────┘

@MarcoGorelli - don't get me wrong - the problem is solvable, it's just painful to do in the current API. Also note that the .first()/.last() approach over a rolling group is not quite what I had in mind, as the rolling group includes the period, but is not exactly the period (ie. the first row might not be excactly 1y earlier)

Here you can see one solution, but you'll note it's a bit verbose:

# Helper 
df_all_dates = pl.DataFrame(
        {'Date': pd.date_range(df.select('Date').min().item(), df.select('Date').max().item())}     
    ).cast({'Date':pl.Date})

(df_all_dates
 .join(df, on=['Date'], how='left')
 .join(df.with_columns(pl.col('Date').dt.offset_by('-1d')), on=['Shape Reported', 'Date'], suffix='_1d', how='left')
 .join(df.with_columns(pl.col('Date').dt.offset_by('-1w')), on=['Shape Reported', 'Date'], suffix='_1w', how='left')
 .join(df.with_columns(pl.col('Date').dt.offset_by('-1mo')), on=['Shape Reported', 'Date'], suffix='_1mo', how='left')
 .join(df.with_columns(pl.col('Date').dt.offset_by('-1y')),  on=['Shape Reported', 'Date'], suffix='_1y', how='left')
 ).select(
     'Date', 'Shape Reported', 

        ((pl.col('Count_1d','Count_1w','Count_1mo','Count_1y')/pl.col('Count')).pow(-1)-1).name.map(lambda t: t.replace('Count_', '% Change over '))
 ).filter(
     pl.col('Shape Reported') == 'LIGHT'
 ).tail(10)

# shape: (10, 6)
# ┌────────────┬────────────────┬─────────────────┬────────────────┬──────────────# ──┬────────────────┐
# │ Date       ┆ Shape Reported ┆ % Change over   ┆ % Change over  ┆ % Change over  ┆ % Change over  │
# │ ---        ┆ ---            ┆ 1d              ┆ 1w             ┆ 1mo            ┆ 1y             │
# │ date       ┆ str            ┆ ---             ┆ ---            ┆ ---            ┆ ---            │
# │            ┆                ┆ f64             ┆ f64            ┆ f64            ┆ f64            │
# ╞════════════╪════════════════╪═════════════════╪════════════════╪══════════════# ══╪════════════════╡
# │ 2014-08-27 ┆ LIGHT          ┆ -0.8            ┆ -0.777778      ┆ null           ┆ null           │
# │ 2014-08-28 ┆ LIGHT          ┆ 1.0             ┆ -0.090909      ┆ null           ┆ null           │
# │ 2014-08-29 ┆ LIGHT          ┆ 0.666667        ┆ 0.666667       ┆ null           ┆ null           │
# │ 2014-08-30 ┆ LIGHT          ┆ -0.571429       ┆ null           ┆ null           ┆ null           │
# │ …          ┆ …              ┆ …               ┆ …              ┆ …              ┆ …              │
# │ 2014-09-02 ┆ LIGHT          ┆ -0.777778       ┆ null           ┆ null           ┆ null           │
# │ 2014-09-03 ┆ LIGHT          ┆ -0.181818       ┆ null           ┆ null           ┆ null           │
# │ 2014-09-04 ┆ LIGHT          ┆ 2.666667        ┆ null           ┆ null           ┆ null           │
# │ 2014-09-05 ┆ LIGHT          ┆ null            ┆ null           ┆ null           ┆ null           │
# └────────────┴────────────────┴─────────────────┴────────────────┴──────────────# ──┴────────────────┘
tim-x-y-z commented 2 months ago

shift_by would be a very useful addition imo and i think fit well in the rolling_*_by type of functions