quantopian / alphalens

Performance analysis of predictive (alpha) stock factors
http://quantopian.github.io/alphalens
Apache License 2.0
3.29k stars 1.14k forks source link

ENH: quantile/bin numbering schema should be more flexible #336

Open quantopiancal opened 5 years ago

quantopiancal commented 5 years ago

Problem Description

If you try to pass a dataframe with a non-continuous list of ints (that starts with 1) in the factor_quantile column to create_turnover_tearsheet(), you will get the following error:

TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'Timedelta'

Also, the list of ints must start with 1 and increase by + 1. For example, if factor_data[factor_quantile].unique() is [1,2,3,4,5], you're good to go. but if factor_data[factor_quantile].unique() is [1,3,4,5], or [2,3,4,5] it will return the error.

I believe this is caused by Alphalens looking for a continuous list (starting with 1) because of the range statement on this line: https://github.com/quantopian/alphalens/blob/master/alphalens/tears.py#L415

Please provide a minimal, self-contained, and reproducible example:

from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
from quantopian.pipeline.data import factset, USEquityPricing
from quantopian.pipeline.factors import AverageDollarVolume

from alphalens.tears import create_turnover_tear_sheet
from alphalens.utils import get_clean_factor_and_forward_returns

def make_pipeline():
    market_cap_filter = factset.Fundamentals.mkt_val.latest > 500000000
    volume_filter = AverageDollarVolume(window_length=200) > 2500000
    price_filter = USEquityPricing.close.latest > 5
    base_universe = market_cap_filter & volume_filter & price_filter

    factor_to_analyze = factset.Fundamentals.assets.latest

    return Pipeline(
        columns = {'factor_to_analyze': factor_to_analyze},
        screen = base_universe & factor_to_analyze.notnull()
    )

pipeline_output = run_pipeline(make_pipeline(), '2015-1-1', '2016-1-1')
pricing_data = get_pricing(pipeline_output.index.levels[1], '2015-1-1', '2016-6-1', fields='open_price')

factor_data = get_clean_factor_and_forward_returns(
    factor = pipeline_output['factor_to_analyze'],
    prices = pricing_data,
)

create_turnover_tear_sheet(factor_data[factor_data['factor_quantile'].isin([1, 3, 4, 5])])

Please provide the full traceback:

TypeErrorTraceback (most recent call last)
<ipython-input-5-1e0de69d03d0> in <module>()
----> 1 create_turnover_tear_sheet(factor_data[factor_data['factor_quantile'].isin([1, 3, 4, 5])])

/usr/local/lib/python2.7/dist-packages/alphalens/plotting.pyc in call_w_context(*args, **kwargs)
     43             with plotting_context(), axes_style(), color_palette:
     44                 sns.despine(left=True)
---> 45                 return func(*args, **kwargs)
     46         else:
     47             return func(*args, **kwargs)

/usr/local/lib/python2.7/dist-packages/alphalens/tears.pyc in create_turnover_tear_sheet(factor_data, turnover_periods)
    415                        for q in range(1, int(quantile_factor.max()) + 1)],
    416                       axis=1)
--> 417             for p in turnover_periods}
    418 
    419     autocorrelation = pd.concat(

/usr/local/lib/python2.7/dist-packages/alphalens/tears.pyc in <dictcomp>((p,))
    415                        for q in range(1, int(quantile_factor.max()) + 1)],
    416                       axis=1)
--> 417             for p in turnover_periods}
    418 
    419     autocorrelation = pd.concat(

/usr/local/lib/python2.7/dist-packages/alphalens/performance.pyc in quantile_turnover(quantile_factor, quantile, period)
    738         shifted_idx = utils.add_custom_calendar_timedelta(
    739                 quant_name_sets.index, -pd.Timedelta(period),
--> 740                 quantile_factor.index.levels[0].freq)
    741         name_shifted = quant_name_sets.reindex(shifted_idx)
    742         name_shifted.index = quant_name_sets.index

/usr/local/lib/python2.7/dist-packages/alphalens/utils.pyc in add_custom_calendar_timedelta(input, timedelta, freq)
    918     days = timedelta.components.days
    919     offset = timedelta - pd.Timedelta(days=days)
--> 920     return input + freq * days + offset
    921 
    922 

/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.pyc in __add__(self, other)
   1648         if isinstance(other, Index):
   1649             return self.union(other)
-> 1650         return Index(np.array(self) + other)
   1651 
   1652     def __radd__(self, other):

TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'Timedelta'

Python / Alphalens versions are whatever the Quantopian research platform is running on as of March 21st 2019.

twiecki commented 5 years ago

CC @luca-s

luca-s commented 5 years ago

Thanks for reporting this @quantopiancal . You are right in saying that Alphalens assumes the list of quantiles/bins must start with 1 and increase by + 1 with no gaps. This is an assumption around the which the code has been built. There are probably only few places where this assumption is used, but I cannot list them by heart.

I believe it is not a big issue since it is possible to pre-process the input data to make it suitable for Alphalens, but it is never nice to have assumptions in the code. If you like to provide a PR to relax this constraint it would certainly be very welcome.

jimportico commented 4 years ago

Hi @luca-s - Changing line 415 in alphalens.tears.create_turnover_tear_sheet from:

    quantile_turnover = \
        {p: pd.concat([perf.quantile_turnover(quantile_factor, q, p)
                       for q in range(1, int(quantile_factor.max()) + 1)],
                      axis=1)
            for p in turnover_periods}

to


    quantile_turnover = \
        {p: pd.concat([perf.quantile_turnover(quantile_factor, q, p)
                       for q in quantile_factor.sort_values().unique().tolist()],
                      axis=1)
            for p in turnover_periods}

relaxes the constraint being discussed here. It's worth noting that the TypeError @quantopiancal mentioned in the original post is tied to an empty series being presented for empty quantiles when working with pandas 0.18.1 (and potentially older versions). I've tested with pandas >= 0.20.3 and the tearsheets run without the modification mentioned above.

I'll submit a PR with this change as a next step.