Closed jbredeche closed 5 years ago
Nice, I will review this in the following days.
thanks @luca-s.
To make sure I understand, the use case is that someone might have several factors (f1
, f2
, f3
) each on the same date range, but a different set of assets. And thus they'd want to do this:
forward_returns = compute_forward_returns(f1, prices, periods=[1,2])
factor_data_1 = get_clean_factor(f1, forward_returns)
factor_data_2 = get_clean_factor(f2, forward_returns)
factor_data_3 = get_clean_factor(f3, forward_returns)
And now, with my changes, they'd instead have to do this:
forward_returns1 = compute_forward_returns(f1, prices, periods=[1,2])
factor_data_1 = get_clean_factor(f1, forward_returns1)
forward_returns2 = compute_forward_returns(f2, prices, periods=[1,2])
factor_data_2 = get_clean_factor(f2, forward_returns2)
forward_returns3 = compute_forward_returns(f3, prices, periods=[1,2])
factor_data_3 = get_clean_factor(f3, forward_returns3)
Given the big performance improvements (at least for medium or large datasets) in compute_forward_returns
, maybe this API change is OK because it's much faster to calculate forward returns now?
Or should we keep this use case, and add a flag to compute_forward_returns
that governs whether we chop assets? (if we don't, we have to go back to the very slow existing codepath)
Given the big performance improvements (at least for medium or large datasets) in compute_forward_returns, maybe this API change is OK because it's much faster to calculate forward returns now?
I agree. If someone complains about this change we can still add a flag to compute_forward_returns
as you suggested.
I'll go on and merge this.
For decent sized inputs (~1M+ asset/day values in
prices
and ~1M asset/day values infactor
), we were spending enormous amounts of time doing index alignment when combining things with different indices.This change makes it so that we are never combining two dataseries with different indices.
The returned dataframe from
compute_forward_returns
used to contain date/asset pairs that didn't exist infactor
, which made it very slow to add thefactor
series into the dataframe inget_clean_factor
. Now,compute_forward_returns
only returns date/asset pairs that exist infactor
.Locally, for a
prices
df with ~4400 unique assets and 3500 days, and afactor
series spanning 859 unique assets across 1308 days,get_clean_factor_and_forward_returns
went from ~203 seconds to ~5 seconds.Note: I'm not a pandas expert at all, so very likely there is something I did that looks weird. Happy to take any suggestions.