Open jmccorriston opened 4 years ago
@luca-s - I'd love to get your thoughts on this!
@jmccorriston I believe you can safely go on with this change and simplify cumulative_returns
. Back in time, the cumulative returns code was something like daily_returns.add(1).cumprod().plot(...)
, which is pretty fast. The result is an approximation of cumulative returns that works well for 90% of the user cases (I believe).
Just be aware that if you go back to that implementation you will lose the ability to (correctly) compute cumulative returns for:
I believe all of the above is fine, as you are interested in daily factors.
Thanks for the quick resopnse, @luca-s !
To be clear, when you say that such a change would lose the ability to compute cumulative returns for periods longer than a day, do you mean weekly/monthly/etc factor data? I could definitely be wrong about this, but I was under the impression that factors with slower periods aren't yet supported given the requirement that the freq
of the input's DateTimeIndex
has to be Day
, BDay
, or CDay
.
Do you have an example that runs with a slower period? My guess is I'm just misinterpreting the meaning of 'period' in your explanation.
I took a read through the tutorial and did another pass over the code and I think I understand the limitation. I think it's important to support the use case where the period
> 1 day. I'll have to dig a bit more into the rate limiting steps in the cumulative_returns
function to see how we can speed things up while still supporting this use case.
In the meantime, would it be sufficient to reframe the solution as taking the mean of the next N N-day
cumulative returns to achieve a similar (same?) result as the subportfolio technique? For instance, if my factor is daily but I want the 5D returns, could I take the next 5 5-day means and average them? Apologies if this is the same as the current implementation. I'm trying to think about how we might be able to express this as a rolling computation instead of iterating over subportfolios (in case this makes things faster).
@jmccorriston my previous reply was not totally correct, but the matter is quite subtle and I didn't want to enter too much into the details...but I will do it now ;)
Initially Alphalens supported only daily data and it had the assumption that factor_data
was a daily frequency dataframe (actually trading day frequency: no weekends + public holidays) and prices
dataframe followed the same assumption. Also periods
was assumed to mean days (e.g. periods=(1,3,5) meants 1 day, 3 days and 5 days returns). Finally the cumulative returns was plotted only for 1 day period.
Given those assumption the code `daily_returns.add(1).cumprod().plot(...) cumpute correctly the cumulative return (almost, the returns are reported one day earlier than they should. So Monday returns are plotted on the previous Friday, Tuesday returns are reported on Monday and so on. This is "ok" if you assume contiguous daily data, it's just a 1 shift error).
The current code doesn't have any assumptions on factor_data
frequency. factor_data
doesn't even have to have any frequency at all (like an event study based factor). Also prices
dataframe doesn't have to have the same index as factor_data
, it can have N prices for each entry in factor_data
(e.g. look at this intraday factor )
Because of the above generalization the code became very complex.
If you like to simplify the cumulative_return
function to daily_returns.add(1).cumprod()
than it will not be depending on the period
variable, that's it. It will still work with any factor frequency (daily, weekly, montly, intraday) but It will not compute cumulative return for periods longer than the factor frequency (in that case you would need to compute parallel portfolios and merge them).
I know it is tricky and maybe you are right in removing these bits of code even if it loses generalization. Let me know if you need help with code internal details. I have a rough idea of what needs to be changed to simplify the cumulative_return function
In the meantime, would it be sufficient to reframe the solution as taking the mean of the next N N-day cumulative returns to achieve a similar (same?) result as the subportfolio technique? For instance, if my factor is daily but I want the 5D returns, could I take the next 5 5-day means and average them?
Unfortunately it is mathematically not identical. I don't know if it can work as an approximation though
Thanks for the extra detail, Luca! I plan to take a crack at this on Tuesday next week. My plan is to try to implement it in terms of the definition in [empyrical](cumulative returns function in empyrical), and possibly address the off-by-one error that you were describing above. I'm an average coder at best so I'll ping you when I make progress in case I'm heading in a different direction from what you're envisioning.
@luca-s I spent some more time thinking about this and poking around the code base today. My tentative plan is to move some of the sub-portfolio logic into the utils
module (is that the right technical term?). The way I think about it is that the performance modules is responsible for computing metrics, the plotting module is responsible for plotting those metrics, and the tearsheet module groups sets of metrics and plots into 'analyses'.
My experience using Alphalens so far gives me the expectation (as a user) that everything in the performance
module should take appropriately formatted factor and forward returns data as input. Any functionality or tooling that aims to get user data into the appropriate format for functions in the performance
module should exist in utils
(this was inspired by the fact that get_clean_factor_and_forward_returns
and friends exist here). This way, the functions in performance
can make stronger assumptions about the structure and content of the input data. Does that make sense to you?
Description
On Quantopian, Alphalens has become more of a central figure as we have been running challenges where the submissions are made as Alphalens tearsheets. In most of these notebooks, the slowest step of running the notebook from top-to-bottom is generating the full Alphalens tearsheet. Recently, I did some profiling of
create_full_tear_sheet
to see if there were any relatively simple opportunities to speed it up. One component that immediately popped up wascumulative_returns
, which was taking ~65% of the total time of running a full tearsheet, and > 75% ofcreate_returns_tear_sheet
.I took a look at the
cumulative_returns
definition, and I was a little surprised by the complexity. Digging into it a bit, it seems the most of the complexity is a product of supportingperiod
s that are smaller (faster?) than the frequency of the providedreturns
data. I'm wondering if it would make sense to simplify the implementation ofcumulative_returns
and either drop support for the case where the period is less than or different from thereturns
data.High-Level Suggestion
Without knowing more about the community that uses Alphalens outside of Quantopian, my first suggestion would be to drop support for the case where the
period
is less than the period of thereturns
data. I was surprised that this case was supported, mostly because I didn't realize thatcumulative_returns
was using interpolation to fill in data points that were required to compute the cumulative return for the specifiedperiod
. Additionally, I think it might be a good idea to leverage the cumulative returns function in empyrical so that results are more likely to line up with other quant finance projects/tools.By dropping support for the case where the
period
is less then the period of thereturns
data and by implementingcumulative_returns
in terms ofcum_returns
inempyrical
, my expectation is that it will become easier to optimize the function for performance.If there's still a desire to support computing cumulative returns with interpolated returns data, maybe it could be split into a separate function. I read through the code and I don't think I fully understand the current implementation, but I understand that this might be an important use case to some folks.
Reproducible example: Download link: al_sample_data.csv
Output (note that actual runtime varies quite a bit between runs/machines, but the % breakdown of cumtime by function remains roughly the same):
Versions
0.3.6
3.7.5
0.25.3
3.1.2