quantopian / alphalens

Performance analysis of predictive (alpha) stock factors
http://quantopian.github.io/alphalens
Apache License 2.0
3.2k stars 1.12k forks source link

ENH: Speed up `compute_forward_returns` and `get_clean_factor`. #327

Closed jbredeche closed 5 years ago

jbredeche commented 5 years ago

For decent sized inputs (~1M+ asset/day values in prices and ~1M asset/day values in factor), we were spending enormous amounts of time doing index alignment when combining things with different indices.

This change makes it so that we are never combining two dataseries with different indices.

The returned dataframe from compute_forward_returns used to contain date/asset pairs that didn't exist in factor, which made it very slow to add the factor series into the dataframe in get_clean_factor. Now, compute_forward_returns only returns date/asset pairs that exist in factor.

Locally, for a prices df with ~4400 unique assets and 3500 days, and a factor series spanning 859 unique assets across 1308 days, get_clean_factor_and_forward_returns went from ~203 seconds to ~5 seconds.

Note: I'm not a pandas expert at all, so very likely there is something I did that looks weird. Happy to take any suggestions.

luca-s commented 5 years ago

Nice, I will review this in the following days.

jbredeche commented 5 years ago

thanks @luca-s.

To make sure I understand, the use case is that someone might have several factors (f1, f2, f3) each on the same date range, but a different set of assets. And thus they'd want to do this:

forward_returns = compute_forward_returns(f1, prices, periods=[1,2])

factor_data_1 = get_clean_factor(f1, forward_returns)
factor_data_2 = get_clean_factor(f2, forward_returns)
factor_data_3 = get_clean_factor(f3, forward_returns)

And now, with my changes, they'd instead have to do this:

forward_returns1 = compute_forward_returns(f1, prices, periods=[1,2])
factor_data_1 = get_clean_factor(f1, forward_returns1)

forward_returns2 = compute_forward_returns(f2, prices, periods=[1,2])
factor_data_2 = get_clean_factor(f2, forward_returns2)

forward_returns3 = compute_forward_returns(f3, prices, periods=[1,2])
factor_data_3 = get_clean_factor(f3, forward_returns3)

Given the big performance improvements (at least for medium or large datasets) in compute_forward_returns, maybe this API change is OK because it's much faster to calculate forward returns now?

Or should we keep this use case, and add a flag to compute_forward_returns that governs whether we chop assets? (if we don't, we have to go back to the very slow existing codepath)

luca-s commented 5 years ago

Given the big performance improvements (at least for medium or large datasets) in compute_forward_returns, maybe this API change is OK because it's much faster to calculate forward returns now?

I agree. If someone complains about this change we can still add a flag to compute_forward_returns as you suggested.

I'll go on and merge this.