Open spillz opened 6 years ago
@spillz Did you find a solution to your problem? I've run into a similar issue. In my case, I am calling dmatrix
repeatedly (e.g., tens of thousands of times), passing a different DataFrame each time. The DataFrame is small (e.g., 4 rows), but the repeated calls are quite slow. See the attached call graph from profiling.
This is either a feature request or a request for help with current functionality. I am doing some work with unbalanced panel data work that involves using patsy to forecast some series. Here's a basic example:
Note that to produce the entire forecast we need to call dmatrix over and over. The problem that I'm having is that it is quite inefficient to have to call dmatrix on the entire DataFrame repeatedly, but because the forecast formula can contain arbitrary numbers of lags I can't just pass in a df filtered to the current year (or a set number of lags from the current year). What would be ideal is if I could replace
with version of dmatrix that takes a boolean
rows
and only evaluates and returns the rows that are neededI thought incr_dbuilder might be able to handle this, but it seems that it expects each chunk returned is completely separate from previous chunks. That won't work in the time series/panel context.