quantopian / alphalens

Performance analysis of predictive (alpha) stock factors
http://quantopian.github.io/alphalens
Apache License 2.0
3.29k stars 1.14k forks source link

Request: Simplify cumulative_returns definition. #357

Open jmccorriston opened 4 years ago

jmccorriston commented 4 years ago

Description

On Quantopian, Alphalens has become more of a central figure as we have been running challenges where the submissions are made as Alphalens tearsheets. In most of these notebooks, the slowest step of running the notebook from top-to-bottom is generating the full Alphalens tearsheet. Recently, I did some profiling of create_full_tear_sheet to see if there were any relatively simple opportunities to speed it up. One component that immediately popped up was cumulative_returns, which was taking ~65% of the total time of running a full tearsheet, and > 75% of create_returns_tear_sheet.

I took a look at the cumulative_returns definition, and I was a little surprised by the complexity. Digging into it a bit, it seems the most of the complexity is a product of supporting periods that are smaller (faster?) than the frequency of the provided returns data. I'm wondering if it would make sense to simplify the implementation of cumulative_returns and either drop support for the case where the period is less than or different from the returns data.

High-Level Suggestion

Without knowing more about the community that uses Alphalens outside of Quantopian, my first suggestion would be to drop support for the case where the period is less than the period of the returns data. I was surprised that this case was supported, mostly because I didn't realize that cumulative_returns was using interpolation to fill in data points that were required to compute the cumulative return for the specified period. Additionally, I think it might be a good idea to leverage the cumulative returns function in empyrical so that results are more likely to line up with other quant finance projects/tools.

By dropping support for the case where the period is less then the period of the returns data and by implementing cumulative_returns in terms of cum_returns in empyrical, my expectation is that it will become easier to optimize the function for performance.

If there's still a desire to support computing cumulative returns with interpolated returns data, maybe it could be split into a separate function. I read through the code and I don't think I fully understand the current implementation, but I understand that this might be an important use case to some folks.

Reproducible example: Download link: al_sample_data.csv

import pandas as pd
import cProfile
import pstats
from alphalens.tears import create_returns_tear_sheet

# Include if running in a Jupyter notebook.
%matplotlib inline

al_inputs = pd.read_csv('al_sample_data.csv', index_col=['date', 'asset'], parse_dates=True)

def run_returns_tear_sheet():
    create_returns_tear_sheet(al_inputs)

p = cProfile.Profile()
p.runcall(run_returns_tear_sheet)
p.dump_stats('returns_tearsheet_profile.stats')

stats = pstats.Stats('returns_tearsheet_profile.stats')
stats.sort_stats('cumtime').print_stats(20)

Output (note that actual runtime varies quite a bit between runs/machines, but the % breakdown of cumtime by function remains roughly the same):

Fri Jan 31 10:09:55 2020    returns_tearsheet_profile.stats

         62029996 function calls (61402118 primitive calls) in 130.569 seconds

   Ordered by: cumulative time
   List reduced from 3735 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  130.587  130.587 <ipython-input-1-eb0a21d53746>:10(run_returns_tear_sheet)
        1    0.001    0.001  130.587  130.587 /Users/jmccorriston/quant-repos/alphalens/alphalens/plotting.py:38(call_w_context)
        1    0.038    0.038  130.566  130.566 /Users/jmccorriston/quant-repos/alphalens/alphalens/tears.py:165(create_returns_tear_sheet)
        6    0.863    0.144   98.381   16.397 /Users/jmccorriston/quant-repos/alphalens/alphalens/performance.py:332(cumulative_returns)
        1    0.000    0.000   80.395   80.395 /Users/jmccorriston/quant-repos/alphalens/alphalens/plotting.py:757(plot_cumulative_returns_by_quantile)
        7    0.000    0.000   80.307   11.472 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/frame.py:6737(apply)
        7    0.000    0.000   80.298   11.471 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/apply.py:144(get_result)
        7    0.001    0.000   80.297   11.471 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/apply.py:261(apply_standard)
       11    0.000    0.000   80.247    7.295 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/apply.py:111(f)
        7    0.000    0.000   56.973    8.139 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/apply.py:297(apply_series_generator)
    13608    0.093    0.000   47.133    0.003 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/series.py:1188(__setitem__)
    13608    0.069    0.000   46.881    0.003 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/series.py:1191(setitem)
     4536    0.136    0.000   46.233    0.010 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/series.py:1261(_set_with)
     4536    0.469    0.000   45.371    0.010 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/series.py:1303(_set_labels)
22715/18179    0.485    0.000   43.101    0.002 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/indexes/base.py:2957(get_indexer)
     4541    0.082    0.000   31.554    0.007 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/indexes/datetimelike.py:686(astype)
     4536    0.035    0.000   30.160    0.007 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py:706(astype)
     4541    0.054    0.000   29.979    0.007 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/arrays/datetimelike.py:516(astype)
     4541    0.023    0.000   29.849    0.007 /Users/jmccorriston/.virtualenvs/alphalens_env/lib/python3.7/site-packages/pandas/core/arrays/datetimelike.py:346(_box_values)
     4548    3.760    0.001   29.825    0.007 {pandas._libs.lib.map_infer}

Versions

jmccorriston commented 4 years ago

@luca-s - I'd love to get your thoughts on this!

luca-s commented 4 years ago

@jmccorriston I believe you can safely go on with this change and simplify cumulative_returns. Back in time, the cumulative returns code was something like daily_returns.add(1).cumprod().plot(...) , which is pretty fast. The result is an approximation of cumulative returns that works well for 90% of the user cases (I believe).

Just be aware that if you go back to that implementation you will lose the ability to (correctly) compute cumulative returns for:

I believe all of the above is fine, as you are interested in daily factors.

jmccorriston commented 4 years ago

Thanks for the quick resopnse, @luca-s !

To be clear, when you say that such a change would lose the ability to compute cumulative returns for periods longer than a day, do you mean weekly/monthly/etc factor data? I could definitely be wrong about this, but I was under the impression that factors with slower periods aren't yet supported given the requirement that the freq of the input's DateTimeIndex has to be Day, BDay, or CDay.

Do you have an example that runs with a slower period? My guess is I'm just misinterpreting the meaning of 'period' in your explanation.

jmccorriston commented 4 years ago

I took a read through the tutorial and did another pass over the code and I think I understand the limitation. I think it's important to support the use case where the period > 1 day. I'll have to dig a bit more into the rate limiting steps in the cumulative_returns function to see how we can speed things up while still supporting this use case.

In the meantime, would it be sufficient to reframe the solution as taking the mean of the next N N-day cumulative returns to achieve a similar (same?) result as the subportfolio technique? For instance, if my factor is daily but I want the 5D returns, could I take the next 5 5-day means and average them? Apologies if this is the same as the current implementation. I'm trying to think about how we might be able to express this as a rolling computation instead of iterating over subportfolios (in case this makes things faster).

luca-s commented 4 years ago

@jmccorriston my previous reply was not totally correct, but the matter is quite subtle and I didn't want to enter too much into the details...but I will do it now ;)

Initially Alphalens supported only daily data and it had the assumption that factor_data was a daily frequency dataframe (actually trading day frequency: no weekends + public holidays) and prices dataframe followed the same assumption. Also periods was assumed to mean days (e.g. periods=(1,3,5) meants 1 day, 3 days and 5 days returns). Finally the cumulative returns was plotted only for 1 day period.

Given those assumption the code `daily_returns.add(1).cumprod().plot(...) cumpute correctly the cumulative return (almost, the returns are reported one day earlier than they should. So Monday returns are plotted on the previous Friday, Tuesday returns are reported on Monday and so on. This is "ok" if you assume contiguous daily data, it's just a 1 shift error).

The current code doesn't have any assumptions on factor_data frequency. factor_data doesn't even have to have any frequency at all (like an event study based factor). Also prices dataframe doesn't have to have the same index as factor_data, it can have N prices for each entry in factor_data (e.g. look at this intraday factor )

Because of the above generalization the code became very complex.

If you like to simplify the cumulative_return function to daily_returns.add(1).cumprod() than it will not be depending on the period variable, that's it. It will still work with any factor frequency (daily, weekly, montly, intraday) but It will not compute cumulative return for periods longer than the factor frequency (in that case you would need to compute parallel portfolios and merge them).

I know it is tricky and maybe you are right in removing these bits of code even if it loses generalization. Let me know if you need help with code internal details. I have a rough idea of what needs to be changed to simplify the cumulative_return function

luca-s commented 4 years ago

In the meantime, would it be sufficient to reframe the solution as taking the mean of the next N N-day cumulative returns to achieve a similar (same?) result as the subportfolio technique? For instance, if my factor is daily but I want the 5D returns, could I take the next 5 5-day means and average them?

Unfortunately it is mathematically not identical. I don't know if it can work as an approximation though

jmccorriston commented 4 years ago

Thanks for the extra detail, Luca! I plan to take a crack at this on Tuesday next week. My plan is to try to implement it in terms of the definition in [empyrical](cumulative returns function in empyrical), and possibly address the off-by-one error that you were describing above. I'm an average coder at best so I'll ping you when I make progress in case I'm heading in a different direction from what you're envisioning.

jmccorriston commented 4 years ago

@luca-s I spent some more time thinking about this and poking around the code base today. My tentative plan is to move some of the sub-portfolio logic into the utils module (is that the right technical term?). The way I think about it is that the performance modules is responsible for computing metrics, the plotting module is responsible for plotting those metrics, and the tearsheet module groups sets of metrics and plots into 'analyses'.

My experience using Alphalens so far gives me the expectation (as a user) that everything in the performance module should take appropriately formatted factor and forward returns data as input. Any functionality or tooling that aims to get user data into the appropriate format for functions in the performance module should exist in utils (this was inspired by the fact that get_clean_factor_and_forward_returns and friends exist here). This way, the functions in performance can make stronger assumptions about the structure and content of the input data. Does that make sense to you?