pydata / patsy

Describing statistical models in Python using symbolic formulas
Other
944 stars 103 forks source link

Forecasting incrementally with patsy #126

Open spillz opened 6 years ago

spillz commented 6 years ago

This is either a feature request or a request for help with current functionality. I am doing some work with unbalanced panel data work that involves using patsy to forecast some series. Here's a basic example:

import io, pandas, patsy

#raw panel data indexed on ID, YEAR. Y is the forecast variable of interest. There are no gaps in the data for an individual entity but the panel is potentially unbalanced (meaning different start/end dates).
data = '''ID,YEAR,Y,B,C,D
1,1999,0,2,3,4
1,2000,.,2,3,4
1,2001,.,2,3,4
1,2002,.,2,3,4
2,1996,1,2,3,4
2,1997,.,2,3,4
3,1998,3,2,3,4
3,1999,3,2,3,4
3,2000,.,2,3,4
3,2001,3,2,3,4
'''
data = io.StringIO(data)
df = pandas.read_csv(data, index_col=['ID','YEAR'], na_values=['.'])
print(df)

def lag(series, n=1):
    return series.groupby(level=0).shift(n)

formula = '1+lag(Y)+B+C+D' #This is the forecast equation for Y
x = patsy.dmatrix(formula,df, return_type='dataframe')
params = pandas.Series([1,2,3,4,5], index=x.columns) #these are the coefficients on the forecast vars

#Now forecast year by year
for yr in range(1997,2010):
    ind = df.index.get_level_values('YEAR')==yr
    x = patsy.dmatrix(formula,df, return_type='dataframe').reindex(df.index) 
    x = x.loc[ind]
    df.loc[ind, 'Y'] = df.loc[ind, 'Y'].fillna(x@params)
    print('================')
    print(yr)
    print(df)

Note that to produce the entire forecast we need to call dmatrix over and over. The problem that I'm having is that it is quite inefficient to have to call dmatrix on the entire DataFrame repeatedly, but because the forecast formula can contain arbitrary numbers of lags I can't just pass in a df filtered to the current year (or a set number of lags from the current year). What would be ideal is if I could replace

    ind = df.index.get_level_values('YEAR')==yr
    x = patsy.dmatrix(formula,df, return_type='dataframe').reindex(df.index) 
    x = x.loc[ind]

with version of dmatrix that takes a boolean rows and only evaluates and returns the rows that are needed

    ind = df.index.get_level_values('YEAR')==yr
    x = patsy.dmatrix(formula,df, return_type='dataframe', rows=ind) #evaluates only on rows there ind==True and returns a dataframe with only those rows

I thought incr_dbuilder might be able to handle this, but it seems that it expects each chunk returned is completely separate from previous chunks. That won't work in the time series/panel context.

MatthewGerber commented 3 years ago

@spillz Did you find a solution to your problem? I've run into a similar issue. In my case, I am calling dmatrix repeatedly (e.g., tens of thousands of times), passing a different DataFrame each time. The DataFrame is small (e.g., 4 rows), but the repeated calls are quite slow. See the attached call graph from profiling.

Screen Shot 2020-12-23 at 8 31 09 AM