Open mkleinbort opened 12 months ago
hi @mkleinbort - could you show input with expected output please?
I definitely would not want this to be part of the shift
operator. Besides can't you just do this (assuming this is what you meant)?
pl.col('x').filter(pl.col('date') > pl.col('date').min() + pl.duration(days=7))
@MarcoGorelli - here is an example
from datetime import date
# The values are time-indexed, but the dates are not evenly spaced:
df = pl.DataFrame({
'date': [date(2020,1,1),date(2020,1,1),date(2020,1,2),date(2020,1,3),date(2020,1,4),date(2020,1,4)],
'entity':['A','B','A','B','A','B',],
'value':['A1', 'B1', 'A2', 'B2', 'A3','B3']
})
Then the ask is
df.with_columns(
value_one_row_before = pl.col('value').shift(1).over('entity'),
value_one_day_before = pl.col('value').shift('1d', by='date').over('entity'),
)
# shape: (6, 5)
# ┌────────────┬────────┬───────┬──────────────────────┬──────────────────────┐
# │ date ┆ entity ┆ value ┆ value_one_row_before ┆ value_one_day_before │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ date ┆ str ┆ str ┆ str ┆ str │
# ╞════════════╪════════╪═══════╪══════════════════════╪══════════════════════╡
# │ 2020-01-01 ┆ A ┆ A1 ┆ null ┆ null │
# │ 2020-01-01 ┆ B ┆ B1 ┆ null ┆ null │
# │ 2020-01-02 ┆ A ┆ A2 ┆ A1 ┆ A1 │
# │ 2020-01-03 ┆ B ┆ B2 ┆ B1 ┆ null │
# │ 2020-01-04 ┆ A ┆ A3 ┆ A2 ┆ null │
# │ 2020-01-04 ┆ B ┆ B3 ┆ B2 ┆ B2 │
# └────────────┴────────┴───────┴──────────────────────┴──────────────────────┘
Note this could be done in other ways in the current API (either re-sampling the dataframe to daily data before using shift, or using a self-join, I just think this API would be pretty clean to work with time-indexed data.
I'm unsure whether this should implement some form of "nearest/earliest/latest" logic when there is no value at the exact date expected.
One way to do it via a join would be:
df.pipe(
lambda x: x.join(
x.select(
'entity',
pl.col('date').dt.offset_by('1d'),
pl.col('value').alias('value_one_day_before')
),
how='left', on=('date','entity')
)
)
@mkleinbort-ic
I'm unsure whether this should implement some form of "nearest/earliest/latest" logic
That would be an asof join. https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.join_asof.html#polars.LazyFrame.join_asof
To me this just feels like it is a special case of a join and not something particularly fundamental.
Yes, it's the same logic.
Maybe I should broaden the ask to introducing a pl.Expr.shift_by
method, similar to a pl.Expr.sort_by
, not to over-complicated the very fundamental .shift()
method.
To me this just feels like it is a special case of a join and not something particularly fundamental.
It's pretty important for year-on-year calculations, or any other such expression.
At least grant me code like this is pretty redable:
(df
.with_columns(
same_store_sales_yoy_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1y')).over('storeId'),
same_store_sales_mom_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1mo')).over('storeId'),
same_store_sales_wow_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1w')).over('storeId'),
)
)
@mkleinbort-ic I don't find that particularly readable because I have no idea what it's supposed to be doing. At the very least it appears to be missing some sort of grouping/aggregation by period.
Like, what is one row of that df
supposed to be, the sales of a single day at the given date of a given store? In such a case it would not be yoy growth, but more like comparing the growth of the 1st of January 2001 compared to the sales of 1st of January 2000. And the existence of 29 Feb 2004 would mess results up with year over year compared to 2005...
Furthermore such a shift would introduce null
s so a large portion of your array would not have any results.
@orlp - you understood the logic of my code well. And of course, there are issues with exact matching of dates for daily sales data, but that is pretty standard and typically handled downstream (forward-fills as good at this, but there are better approaches depending on the usecase).
I don't think we need to get into a discussion of time series analysis for retail data - it suffices to say that sometimes w/w sales (e.g. comparing Monday to Monday) is more important than month-on-month sales (e.g. Feb 3rd to March 3rd), but sometimes y-on-year values are most important (Dec 25th '22 to Dec 25th '23).
The main point I am getting at is that this code:
(df
.with_columns(
same_store_sales_yoy_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1y')).over('storeId'),
same_store_sales_mom_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1mo')).over('storeId'),
same_store_sales_wow_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1w')).over('storeId'),
same_store_sales_dod_growth = (pl.col('sales') / pl.col('sales').shift_by('date', '1d')).over('storeId'),
)
)
could be implemented with a few joins (eg. a left join on all calendar days) and the .shift
operator - but I find my proposed .shift_by
is more readable and composable.
why not use rolling
with say '1y'
, and then in the aggregation do something like pl.col('sales').last() / pl.col('sales').first()
?
Here is some toy data to work thorough it (we are now talking about UFO sightings counts, by UFO shape, by daily)
df = (pl
.read_csv('https://raw.githubusercontent.com/mkleinbort/resource-datasets/master/ufo_sightings/ufo.csv')
.with_columns(Datetime = pl.col('Time').str.strptime(pl.Datetime, "%m/%d/%Y %H:%M"))
.with_columns(Date = pl.col('Datetime').dt.date())
.group_by('Shape Reported', 'Date').count().rename({'count':'Count'})
.sort('Date', 'Shape Reported')
)
# shape: (48_165, 3)
# ┌────────────────┬────────────┬───────┐
# │ Shape Reported ┆ Date ┆ Count │
# │ --- ┆ --- ┆ --- │
# │ str ┆ date ┆ u32 │
# ╞════════════════╪════════════╪═══════╡
# │ TRIANGLE ┆ 1930-06-02 ┆ 1 │
# │ OTHER ┆ 1930-07-01 ┆ 1 │
# │ OVAL ┆ 1931-02-16 ┆ 1 │
# │ DISK ┆ 1931-06-02 ┆ 1 │
# │ … ┆ … ┆ … │
# │ OVAL ┆ 2014-09-04 ┆ 2 │
# │ TRIANGLE ┆ 2014-09-04 ┆ 3 │
# │ null ┆ 2014-09-05 ┆ 1 │
# │ LIGHT ┆ 2014-09-05 ┆ 3 │
# └────────────────┴────────────┴───────┘
@MarcoGorelli - don't get me wrong - the problem is solvable, it's just painful to do in the current API. Also note that the .first()/.last() approach over a rolling group is not quite what I had in mind, as the rolling group includes the period, but is not exactly the period (ie. the first row might not be excactly 1y earlier)
Here you can see one solution, but you'll note it's a bit verbose:
# Helper
df_all_dates = pl.DataFrame(
{'Date': pd.date_range(df.select('Date').min().item(), df.select('Date').max().item())}
).cast({'Date':pl.Date})
(df_all_dates
.join(df, on=['Date'], how='left')
.join(df.with_columns(pl.col('Date').dt.offset_by('-1d')), on=['Shape Reported', 'Date'], suffix='_1d', how='left')
.join(df.with_columns(pl.col('Date').dt.offset_by('-1w')), on=['Shape Reported', 'Date'], suffix='_1w', how='left')
.join(df.with_columns(pl.col('Date').dt.offset_by('-1mo')), on=['Shape Reported', 'Date'], suffix='_1mo', how='left')
.join(df.with_columns(pl.col('Date').dt.offset_by('-1y')), on=['Shape Reported', 'Date'], suffix='_1y', how='left')
).select(
'Date', 'Shape Reported',
((pl.col('Count_1d','Count_1w','Count_1mo','Count_1y')/pl.col('Count')).pow(-1)-1).name.map(lambda t: t.replace('Count_', '% Change over '))
).filter(
pl.col('Shape Reported') == 'LIGHT'
).tail(10)
# shape: (10, 6)
# ┌────────────┬────────────────┬─────────────────┬────────────────┬──────────────# ──┬────────────────┐
# │ Date ┆ Shape Reported ┆ % Change over ┆ % Change over ┆ % Change over ┆ % Change over │
# │ --- ┆ --- ┆ 1d ┆ 1w ┆ 1mo ┆ 1y │
# │ date ┆ str ┆ --- ┆ --- ┆ --- ┆ --- │
# │ ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
# ╞════════════╪════════════════╪═════════════════╪════════════════╪══════════════# ══╪════════════════╡
# │ 2014-08-27 ┆ LIGHT ┆ -0.8 ┆ -0.777778 ┆ null ┆ null │
# │ 2014-08-28 ┆ LIGHT ┆ 1.0 ┆ -0.090909 ┆ null ┆ null │
# │ 2014-08-29 ┆ LIGHT ┆ 0.666667 ┆ 0.666667 ┆ null ┆ null │
# │ 2014-08-30 ┆ LIGHT ┆ -0.571429 ┆ null ┆ null ┆ null │
# │ … ┆ … ┆ … ┆ … ┆ … ┆ … │
# │ 2014-09-02 ┆ LIGHT ┆ -0.777778 ┆ null ┆ null ┆ null │
# │ 2014-09-03 ┆ LIGHT ┆ -0.181818 ┆ null ┆ null ┆ null │
# │ 2014-09-04 ┆ LIGHT ┆ 2.666667 ┆ null ┆ null ┆ null │
# │ 2014-09-05 ┆ LIGHT ┆ null ┆ null ┆ null ┆ null │
# └────────────┴────────────────┴─────────────────┴────────────────┴──────────────# ──┴────────────────┘
shift_by
would be a very useful addition imo and i think fit well in the rolling_*_by
type of functions
Description
It'd be nice to be able to write
Where 'date' is some other date or datetime column.
There are many edge cases this would need to handle - mainly what to do if there is no unique record '7d' ago in this example.