quantopian / alphalens

Performance analysis of predictive (alpha) stock factors
http://quantopian.github.io/alphalens
Apache License 2.0
3.36k stars 1.15k forks source link

Feature request: Use alphalens with returns instead of prices #253

Closed neman018 closed 6 years ago

neman018 commented 6 years ago

Is there a way to run Alphalens using my own custom input matrix of returns (ie. custom factor adjusted) rather than inputting Prices (ie. the "pricing" table in the examples) for each period?

luca-s commented 6 years ago

You could skip the call to utils.get_clean_factor_and_forward_returns and format your data accordingly to what the tears.* functions expect (the format is the return value of utils.get_clean_factor_and_forward_returns, have a look at the docstrings ).

More generally utils.get_clean_factor_and_forward_returns is an utility function, you don't have to call it if you don't need it but you can have a look at the source code, that might help you formatting your data .

luca-s commented 6 years ago

@all I wonder how often it happens that the returns are available instead of prices. If that's a common scenario we have to think about providing an utility function that prepare and format the input data from returns information.

neman018 commented 6 years ago

Any chance you guys will add a utility function for returns (vs stock prices)? I think this might be useful for a lot users who construct their own factor models and consequently custom return streams

If not, I'll try to implement myself..

HereticSK commented 6 years ago

I believe it is quite common to iterate different factors while returns do not have to change. It would be a lot easier and faster to have a utility function that construct input data from returns.

Currently, I implement this on my own. But I believe it may be a good idea to include this in the package, if it is common enough.

twiecki commented 6 years ago

@luca-s Seems like there is indeed quite a bit demand for it.

luca-s commented 6 years ago

@neman018 Could you provide some more details about your use case? Could you provide an example API that would suit your case and which would be general enough to be a new Alphalens API?

luca-s commented 6 years ago

@HereticSK as you have already implemented it, would it be possible to share your ideas so that we can evaluate them?

HereticSK commented 6 years ago

Actually we've discussed something similar in #199 . I think the pain point here is that get_clean_factor_and_forward_returns repeats the slow forward return computation every time. So I simply split out this part from it and reuse everything else.

I make this create_factor_data function based on get_clean_factor_and_forward_returns, where computing forward returns is replaced with pre-computed values, and in the signature price is replaced with forward_returns, periods removed.

def create_factor_data(factor,
                       forward_returns,
                       groupby=None,
                       by_group=False,
                       quantiles=5,
                       bins=None,
                       groupby_labels=None):

    # Replace calling compute_forward_returns with pre-computed values
    merged_data = forward_returns.copy() 

    # Keep everything else unchanged below
    ......

Now the analysis code becomes something like:

periods = (20, 40, 60, 120)
forward_returns = alphalens.utils.compute_forward_returns(prices=pricing, periods=periods)

factor_data_1 = create_factor_data(
    factor=factor_1, 
    forward_returns=forward_returns,
    quantiles=10
)

factor_data_2 = create_factor_data(
    factor=factor_2, 
    forward_returns=forward_returns,
    quantiles=10
)

I find reusing the pre-computed forward returns saves much time, as I often test different factors with the same set of forward returns.

luca-s commented 6 years ago

@HereticSK your solution seems reasonable.

What about adding forward_returns argument to get_clean_factor_and_forward_returns? Only one between prices and forward_returns should be not None: if prices is provided then the forward_returns will be computed, otherwise they will be copied like in your code.

The other approach would be to add another function get_clean_factor which requires forward_returns but not prices. In this case get_clean_factor_and_forward_returns would simply compute forward_returns from 'prices' and then call get_clean_factor (which would do all the work currently done by get_clean_factor_and_forward_returns except forward_returns computation).

My only concern is to keep clean API, I don't want to make them confusing.

@twiecki What do you think about this?

luca-s commented 6 years ago

@neman018 Would the proposed solution be enough to fix your use case too?

twiecki commented 6 years ago

@luca-s I like your first proposal.

HereticSK commented 6 years ago

For the first proposal, it should be noted that both the prices and periods arguments are related to forward return computation. For clarity, only one between forward_returns and (prices, periods) should not be None, even though allowing periods not to be None won't cause trouble. I am afraid that would make the validation logic a little bit too verbose. Also, allowing both prices and forward_returns seems to make the function do too much, as I perceive.

In that sense, the second proposal looks better to me. But if backward compatibility is of greater concern, the first one is good.

luca-s commented 6 years ago

@HereticSK oh right, I forgot about periods thanks. Well we can simply avoid the check on periods if prices is not present, as you suggested.

The second proposal wouldn't break compatibility, I Imagine something like this:

# implemented as current get_clean_factor_and_forward_returns and signature except
# for forward_returns computation and periods
get_clean_factor(factor, forward_returns, ....):
    ...

# same function signature as current implementation but it would call get_clean_factor
get_clean_factor_and_forward_returns(factor, prices, periods, ....):
    forward_returns = ... # compute forward_returns
    return get_clean_factor(factor, forward_returns, ....)

Still, having a single function seems more user friendly even though, from code point of view, is a little messy.

HereticSK commented 6 years ago

@luca-s Yes, I agree that a single function is more user friendly.

Yet there is another concern for the first proposal. the name get_clean_factor_and_forward_returns, as I see it, somehow indicates forward_returns to be the output, while it accepts forward_returns as input. This seems a bit confusing. Just my personal perception, don't know if it is a serious problem.

luca-s commented 6 years ago

I cannot make up my mind on what is the best approach. I have a slight preference for the second approach, as @HereticSK. it seems cleaner and the API are more obvious, even though we lose the advantage of having a single entry point for the user.

@twiecki So, I am just double checking with you that the first option is the one we want to go for and then move on with the implementation.

luca-s commented 6 years ago

@HereticSK would you be interested in providing a PR?

HereticSK commented 6 years ago

I'd be happy to do it. Let's see how @twiecki thinks. I will be working on it once we decide on which approach to go.

twiecki commented 6 years ago

I'm fine with either one. Let's ask our in-house API expert @ssanderson.

On Feb 1, 2018 11:34 AM, "HereticSK" notifications@github.com wrote:

I'd be happy to do it. Let's see how @twiecki https://github.com/twiecki thinks. I will be working on it once we decide on which approach to go.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quantopian/alphalens/issues/253#issuecomment-362225569, or mute the thread https://github.com/notifications/unsubscribe-auth/AApJmDTz17P1yRoLkpZfAMK4awv8K1gLks5tQZNLgaJpZM4RZ8zg .

ssanderson commented 6 years ago

Sorry, I missed this ping. Will look at this today.

neman018 commented 6 years ago

Thanks for pushing this along guys - I think it will be very useful for the community!

ssanderson commented 6 years ago

I'm in favor of @HereticSK's suggestion in https://github.com/quantopian/alphalens/issues/253#issuecomment-361794546, to factor out a separate create_factor_data function and have the implementation of get_clean_factor_and_forward_returns be composed out of get_forward_returns and create_factor_data.

I think it's hard in general to make objective arguments about API design, but here's an attempt at a justification for why I prefer this design over adding another optional argument to get_clean_factor_and_forward_returns:

ssanderson commented 6 years ago

Sub-comment: if we separate out a new function, it would be nice to have a more descriptive name than "create_factor_data". Is there a better word than "data" we could use for "stacked dataframe of forward returns plus group and factor information"?

The two metaphors I associate with alphalens are science experiments (e.g looking at the factor data under a microscope) and prospecting (e.g. panning for gold in your factor data). "survey"? "study?", "experiment?", "assay?", "analysis"?

twiecki commented 6 years ago

Thanks @ssanderson, I like your arguments.

HereticSK commented 6 years ago

As I get working on a PR, I have got some second thoughts, hope to discuss more with you guys @luca-s @twiecki @ssanderson .

The current proposal we've agreed on:

# implemented as current get_clean_factor_and_forward_returns and signature except
# for forward_returns computation and periods
get_clean_factor(factor, forward_returns, ....):
    ...

# same function signature as current implementation but it would call get_clean_factor
get_clean_factor_and_forward_returns(factor, prices, periods, ....):
    forward_returns = ... # compute forward_returns
    return get_clean_factor(factor, forward_returns, ....)

There are 3 kinds of logic here:

Is it a better idea to make get_clean_factor only responsible for second logic, i.e. quantizing factor, and have a new function get_clean_factor_data (ignore the naming for the moment) deal with the third one? Slightly different from the current proposal, but seems much cleaner. Something like this:

# get_clean_factor is only responsible for processing factor
get_clean_factor(factor, quantiles, groupby, group_labels, ....):
    ...

# get_clean_factor_data can be used to combine different sets of
# pre-computed factor and forward_returns
# arguments related to processing factor and forward_returns are
# completely factored out, only max_loss remain
get_clean_factor_data(factor, forward_returns, max_loss):
    factor = get_clean_factor(factor, ....)
    merged_data = ... # merge forward_returns and factor

    # handling data loss information
    ...

# get_clean_factor_and_forward_returns keeps the current signature
# but implemented as a simple wrapper
get_clean_factor_and_forward_returns(factor, prices, periods, ....):
    forward_returns = ... # compute forward_returns
    factor = get_clean_factor(factor, forward_returns, ...)
    factor_data = get_clean_factor_data(factor, forward_returns)
    return factor_data
    ...

If this is the right way to go, there's something I am not sure. Currently, computing quantized factor involves calling utils.quantize_factor, whose factor_data argument assumes that the factor series is already merged into a data frame with forward_returns and grouping columns. It does almost the same thing as get_clean_factor(factor, quantiles, groupby, group_labels, ....).

If we accept changing its function signature, there's no need to have a separate get_clean_factor function. If I am not mistaken, utils.quantize_factor is not used elsewhere in the package except get_clean_factor_and_forward_returns. So I think it is safe to change?

HereticSK commented 6 years ago

As for @ssanderson 's sub-comment about naming, I do agree that the word "data" is a bit too general. I did struggle for some time finding a better name. "factor_data" is the best I can come up with, as it is widely and consistently used in the tears module as the core data structure. It may be just clear enough, to use in the context of alphalens. Do let me know if there are any better names.

luca-s commented 6 years ago

@ssanderson thank you for your comments and for explaining so carefully your reasoning. I agree with your statements and I am also fine with changing factor_data with some other name but I would suggest to do that in a different PR whose purpose is only refactoring. This is because factor_data is extensively used in most of Alphalens function signatures, if we change the name we have to do it everywhere in one single PR (and there will be many many changes).

luca-s commented 6 years ago

@HereticSK Your proposal seems even cleaner. I believe we can initially build these new API without publicising them in the example NBs. We will see later if they work well or if they need refinements, we can consider them implementation details for now. @ssanderson ?

Currently, computing quantized factor involves calling utils.quantize_factor, whose factor_data argument assumes that the factor series is already merged into a data frame with forward_returns and grouping columns. It does almost the same thing as get_clean_factor(factor, quantiles, groupby, group_labels, ....).

Actually the input of utils.quantize_factor could be a much simpler DataFrame than factor_data: it only requires 'factor' column and optionally 'group', it doesn't use forward returns information. We should probably fix the docstring.

If we accept changing its function signature, there's no need to have a separate get_clean_factor function. If I am not mistaken, utils.quantize_factor is not used elsewhere in the package except get_clean_factor_and_forward_returns. So I think it is safe to change?

get_clean_factor should compute the grouping too, so they perform different computation.

Few comments:

  1. get_clean_factor_data could be called combine_clean_factor_and_forward_returns maybe (but I am fine with other names)?
  2. the data loss can happen in get_clean_factor too, due to the binning
HereticSK commented 6 years ago

@luca-s

If we accept changing its function signature, there's no need to have a separate get_clean_factor function.

Originally I meant having a single function both compute the grouping and quantize the factor, returning a data frame containing the original factor. Something like

quantize_factor(factor, quantiles, groupby, group_labels, ....):
    # compute the grouping, the same code as
    # what is currently in get_clean_factor_and_forward_returns 
    if groupby is not None:
        ......

    # compute quantized factor

    # return a merged data frame
    return merged_data

The reason why I have this idea is: 1) quantized_factor has to know the grouping information. If this information comes from the input factor_data, as the current implementation, The user has to create a data frame before calling the function, which can be inconvenient. Alternatively, we can add a groupby argument to accept the grouping series/dict. In that case, quantized_factor becomes exactly the same as get_clean_factor. In either way the grouping info has to come along with the factor, I thought it would be simpler to keep only one function. 2) currently quantized_factor returns a factor_quantile series. It is not used anywhere other than inside get_clean_factor_and_forward_returns. Is there any use cases where we just want a factor_quantile series rather than a factor_data data frame?

But now I am hesitant. Keeping quantized_factor provides more flexiblity and makes get_clean_factor cleaner.

get_clean_factor_data could be called combine_clean_factor_and_forward_returns maybe (but I am fine with other names)?

combine_clean_factor_and_forward_returns is good, consistent with the current API get_clean_factor_and_forward_returns. Yet get_factor_data/get_clean_factor_data has its advantage. It indicates its output to be a factor_data, which in turn becomes the input of further analysis. I am good with both.

the data loss can happen in get_clean_factor too, due to the binning

This is another detail I am not sure. Currently, both the data loss in the forward_returns computation and the binning phase are reported in get_clean_factor_and_forward_returns. Should we handle both kind of data loss in combine_clean_factor_and_forward_returns? Or handle binning loss in handling in get_clean_factor, fwd loss in combine_clean_factor_and_forward_returns?

luca-s commented 6 years ago

But now I am hesitant. Keeping quantized_factor provides more flexiblity and makes get_clean_factor cleaner.

I would not change quantized_factor, it is a well defined function which has a single purpose and does it well.

combine_clean_factor_and_forward_returns is good, consistent with the current API get_clean_factor_and_forward_returns. Yet get_factor_data/get_clean_factor_data has its advantage. It indicates its output to be a factor_data, which in turn becomes the input of further analysis. I am good with both.

I would go with combine_clean_factor_and_forward_returns

This is another detail I am not sure. Currently, both the data loss in the forward_returns computation and the binning phase are reported in get_clean_factor_and_forward_returns. Should we handle both kind of data loss in combine_clean_factor_and_forward_returns? Or handle binning loss in handling in get_clean_factor, fwd loss in combine_clean_factor_and_forward_returns?

Regardless of where the data loss occurs, I would report/print the data loss information in both get_clean_factor_and_forward_returns and combine_clean_factor_and_forward_returns, which are the functions that returns factor_data.

ssanderson commented 6 years ago

Sorry for the slow reply. This is in my queue to look at tomorrow.

MichaelJMath commented 6 years ago

get_clean_factor_data could be called combine_clean_factor_and_forward_returns maybe (but I am fine with other names)?

My vote would also be for combine_clean_factor_and_forward_returns as it seems to be a bit more intuitive as to what the function actually does.

ssanderson commented 6 years ago

There are 3 kinds of logic here:

  • take prices, compute forward returns, return a forward_returns data frame
  • take a factor series, return a data frame containing the original factor, the quantized version, and optionally the grouping infomation
  • combine the forward_returns and the factor data frame into a single factor_data data frame, report data loss information

This seems like a reasonable decomposition. I think it might be helpful to write out what we thing the expected use cases would be for these functions. It sounds to me like we think there are three kinds of users who want to construct a factor_data frame:

  1. Users who have prices and a single factor and want to build a factor_data frame they can pass to other alphalens functions. These users should continue to use get_clean_factor_and_forward_returns.
  2. Users who have prices and multiple factors. These users should call forward_returns on their prices and then call get_clean_factor_data(factor, forward_returns, max_loss) for each of their factors, which saves the computational cost of re-computing forward returns multiple times.
  3. User who already have forward returns and have one or more factors. These users should just call get_clean_factor_data(factor, forward_returns) on each of their factors.

Am I missing anything here? From this analysis it seems to me like it doesn't matter much from an API perspective whether we separate out the logic of factor processing and joining the factor with the forward returns; I don't think we expect any users to call combine_factor_and_forward_returns? It might still be cleaner from an implementation perspective to split out the combination function, but I think that would just be an implementation detail as @luca-s noted above.

ssanderson commented 6 years ago

As an aside, in the long term I'm not sure I'm convinced we need combining logic at all. I commented in https://github.com/quantopian/alphalens/pull/258#issuecomment-364843134 that the current factor_data layout forces a lot of duplication if you want to analyze multiple factors.

Depending on how important a use-case it is to analyze multiple factors (and it seems like the desire to efficiently analyze multiple factors is one of the motiviating ideas for this thread), I think we might want to consider breaking apart the monolithic factor_data format into returns + group columns (which can be shared across multiple analyses) and factor-specific factor value/quantile columns . That's a much larger API change than what we've been discussing so far though, so it probably deserves its own discussion thread.

luca-s commented 6 years ago

It seems like we all reached an agreement. Let me summarize and add some more details.

The official API we like to have are:

  1. forward_returns(factor, prices,...) we can refactor and rename the existing compute_forward_returns so that it accepts factor and prices and also performs the few steps that are currently done inside get_clean_factor_and_forward_returns, that is:
def forward_returns(factor, prices,...)
    factor = factor.copy()
    factor.index = factor.index.rename(['date', 'asset'])
    factor_dateindex = factor.index.get_level_values('date').unique()

    # copy here the current body of compute_forward_returns
    [...]
  1. get_clean_factor_data(factor, forward_returns, max_loss, ...). This would do what get_clean_factor_and_forward_returns currently does except for the forward_return computation (It would also accept all the arguments).

  2. get_clean_factor_and_forward_returns: behave the same as now. It can be re-implemented calling compute_forward_returns + get_clean_factor_data

From the implementation point of view we could also have combine_factor_and_forward_returns and get_clean_factor but we don't expect to maintain backward compatibility for those.

Regarding the split of factor_data into returns + group columns, I agree we should cover that on a separate thread.

HereticSK commented 6 years ago

forward_returns(factor, prices,...) we can refactor and rename the existing compute_forward_returns so that it accepts factor and prices and also performs the few steps that are currently done inside get_clean_factor_and_forward_returns

@luca-s is this addressing @ssanderson 's second case?

Users who have prices and multiple factors. These users should call forward_returns on their prices and then call get_clean_factor_data(factor, forward_returns, max_loss) for each of their factors, which saves the computational cost of re-computing forward returns multiple times.

If I'm not mistaken, what he referred to as forward_returns is exactly the same as the current compute_forward_return. The calling code for this case should be something like below, which does not seem to require refactoring compute_forward_returns:

forward_returns = compute_forward_return(prices, periods, ...)
factor_data = get_clean_factor_data(factor, forward_returns, ...)

I believe the logic for computing forward returns is quite independent of factors. Am I missing anything?

That summarizes as to refactor 1 current API function and to introduce 1 new API function:

get_clean_factor_and_forward_returns: behave the same as now. It can be re-implemented calling compute_forward_returns + get_clean_factor_data

get_clean_factor_data(factor, forward_returns, max_loss, ...). This would do what get_clean_factor_and_forward_returns currently does except for the forward_return computation (It would also accept all the arguments).

Since they are considered as the new API, is it a good idea to rename get_clean_factor_data as get_factor_data_from_forward_returns and get_clean_factor_and_forward_returns as get_factor_data_from_prices? That seems consistent. The current get_clean_factor_and_forward_returns can be kept as an alias for get_factor_data_from_prices for backward compatibility.

luca-s commented 6 years ago

@luca-s is this addressing @ssanderson 's second case?

Yes

If I'm not mistaken, what he referred to as forward_returns is exactly the same as the current compute_forward_return.

Not exactly, the first argument of compute_forward_return is factor_idx and the intent of the little piece of code I posted above was to make the function more user friendly and let it accept factor and prices so that you can write:

forward_returns = compute_forward_return(factor, prices, periods, ...)
factor_data = get_clean_factor_data(factor, forward_returns, ...)

We don't need to rename compute_forward_return to forward_returns if we don't want to, I just wanted to use the function names used by @ssanderson to avoid confusion.

I believe the logic for computing forward returns is quite independent of factors. Am I missing anything?

It actually needs the factor index. Is it possible that you are looking at an old version of compute_forward_return? It has been recently changed.

luca-s commented 6 years ago

Regarding the API names I wouldn't rename get_clean_factor_and_forward_returns but it maybe makes sense to rename get_clean_factor_data to get_clean_factor.

HereticSK commented 6 years ago

oh sorry, I was looking at the old version. Get it now, will try to submit a PR this week.

HereticSK commented 6 years ago

Hi, guys. I just submitted a PR, implementing the new API as we discussed earlier. Please review and let me know if there are questions.

As I tried to refactor compute_forward_returns, I wonder, would it be against our original purpose to have compute_forward_returns depend on the factor? Originally, we want to decouple forward returns and factors so that uses can analyze different factor using the same set of forward returns. Under the current design, compute_forward_returns(factor, prices), the user has to recompute forward returns for different factors, unless he/she assumes that the forward returns computed with one factor applies to other factors.

If I am not mistaken, the factor in compute_forward_returns is only used to infer the trading calendar and get the custom frequency. Is it possible to do this without the factor? Say, assuming the price index as the full trading calendar, consider the set difference between the full calendar and the price calendar as custom holidays, and construct CustomBusinessDay with these holidays?

luca-s commented 6 years ago

@HereticSK I understand your point, though it's quite safe to assume that the forward returns computed with one factor applies to other factors, which allow to re-use forward returns. As long as multiple factors have the same date index the forward returns will be the same, e.g. multiple factors computed daily at market open.

I don't believe it would be easy to remove factor from compute_forward_returns arguments, there are many scenarios covered by Alphalens where your proposed solution wouldn't work. Have a look at intraday NB example, where the prices are those:

image

Given those prices and periods=(1, 2, 3, 6) the function compute_forward_returns has to infer that the actual time periods are 1h, 3h, 1D, 2D. Without having the factor date index, which gives the information of when the trades happen, I don't believe the function can compute the required output.

luca-s commented 6 years ago

Also, I thought again about the new functions names (get_clean_factor_and_forward_returns and get_clean_factor) and I understand why you prefer to change the names, the current ones are not too explanatory. It might be good to rename the functions but please note that if we change get_clean_factor_and_forward_returns then all example NBs and many tests have to be updated too. If we really want to go through this path then it would be better to create a separate PR, which would involve only the renaming. That would create an easier to read git history.

EDIT: my comment on creating a separate PR doesn't make any sense, as we can have multiple commits on the same PR. So as long as @twiecki and @ssanderson agree, we can rename get_clean_factor_and_forward_returns (and change example NBs and tests accordingly)

HereticSK commented 6 years ago

Without having the factor date index, which gives the information of when the trades happen, I don't believe the function can compute the required output.

Sorry, I am not following.

As illustrated in the intraday_factor example:

The pricing data passed to alphalens should contain the entry price for the assets so it must reflect the next available price after a factor value was observed at a given timestamp. ...... The pricing data must also contain the exit price for the assets

As long as the factor timestamps is a subset of the price timestamps, a given factor timestamp can always find its corresponding entry price, and thus the corresponding forward return.

In terms of the code, factor_idx is first intersected with prices.index, and then passed to infer_trading_calendar. From here on factor_idx becomes a subset of prices.index

    factor_idx = factor_idx.intersection(prices.index)

    forward_returns = pd.DataFrame(index=pd.MultiIndex.from_product(
        [factor_idx, prices.columns], names=['date', 'asset']))

    freq = infer_trading_calendar(factor_idx, prices.index)

And then in infer_trading_calendar, factor_idx and prices.index are unioned:

def infer_trading_calendar(factor_idx, prices_idx):
    full_idx = factor_idx.union(prices_idx)

Is it a round-trip here? Is full_idx always equals prices_idx? Am I missing anything?

The following piece of code affects the shape of the forward_returns dafa frame. Can this be put in get_clean_factor?

    for period in periods:

        #
        # build forward returns
        #
        fwdret = prices.pct_change(period).shift(-period).reindex(factor_idx)
luca-s commented 6 years ago

As long as the factor timestamps is a subset of the price timestamps, a given factor timestamp can always find its corresponding entry price, and thus the corresponding forward return.

I agree but there are two subtleties to consider:

  1. the forward returns a factor needs are the ones computed starting from factor date index, we don't need the forward returns computed for every date in prices. Given that one of the point of this issue is to save computation time we don't want to increase the computation required for the common scenario of a single factor. In the intraday example if we computed the forward returns for each date in prices index we would increase the computation time by 3 times (prices index is 3 times larger than factor index) and the memory required too.
  2. Depending on the factor index used to compute the forward returns, we might end up with different time periods. For example if we look at the intraday NB, the time periods inferred 1h, 3h, 1D, 2D would be different for a factor with a different date index e.g. a factor which is computed at 12:30 instead of 9:30.

Is it a round-trip here? Is full_idx always equals prices_idx? Am I missing anything?

We don't know if prices is correctly populated, it might miss some dates from factor. To be correct in the computation we perform a union of the two indices instead of relying on the correctness of prices.

The following piece of code affects the shape of the forward_returns data frame. Can this be put in get_clean_factor?

I don't get it. If the code affects forward returns, isn't it ok to have it inside compute_forward_returns?

HereticSK commented 6 years ago

we don't want to increase the computation required for the common scenario of a single factor

fwdret = prices.pct_change(period).shift(-period).reindex(factor_idx) forward returns is actually computed for all prices dates, and then part of it discarded. It is the time for stacking data frames and combing columns that is saved, rather than that for computing forward returns. I understand that It can be significantly more expensive to concatenate columns than computing pct_changes. In this case I agree that there is a performance gain.

But when it comes to the multi-factor analysis case, if the length of the forward return data frame depends on the factor, the user has to carefully choose a factor to compute forward returns so that the forward returns can be reused in other analysis, since custom factor algorithms may introduce unpredictable number of missing values.

Maybe it does not need to subset the forward returns frame in compute_forward_returns. Aligning factor and forward_returns is eventually handled in get_clean_factor(factor, forward_returns).

It seems to be a trade-off here. Personally, I think in the single factor scenario forward returns are computed only once, so the overhead is acceptable, and it may not often be significant. On the other hand, of course, multi-factor users can, in a tricky way, get the full-date forward returns by passing the prices itself as a factor.

We don't know if prices is correctly populated, it might miss some dates from factor. To be correct in the computation we perform a union of the two indices instead of relying on the correctness of prices

  1. In practice, I assume that pricing data, as raw data, is less likely to omit any trading dates, relative to factors computed with custom algorithms. This is reliable at least in my local market. Is this assumption so weak in other markets that we have to pay special attention?
  1. In terms of the code, full_idx = factor_idx.union(prices_idx), where factor_idx has already become the result of the intersection between factor_idx and prices_idx, effectively a subset of prices_idx. The union of a subset and prices_idx itself always yields prices_idx itself. In other words, factor_idx does not matter here. Am I missing anything?
luca-s commented 6 years ago

Maybe it does not need to subset the forward returns frame in compute_forward_returns. Aligning factor and forward_returns is eventually handled in get_clean_factor(factor, forward_returns).

I guess we could accept a little decrease in performance, but how could you fix the problem 2 from my previous comment?

The proposal of @ssanderson would probably help here ( "we might want to consider breaking apart the monolithic factor_data format into returns + group columns (which can be shared across multiple analyses) and factor-specific factor value/quantile columns" )

In terms of the code, full_idx = factor_idx.union(prices_idx), where factor_idx has already become the result of the intersection between factor_idx and prices_idx, effectively a subset of prices_idx. The union of a subset and prices_idx itself always yields prices_idx itself. In other words, factor_idx does not matter here. Am I missing anything?

No, you are right. infer_trading_calendar should have been called at the beginning of the function before the factor index mangling.Could you fix that too in the PR?

HereticSK commented 6 years ago

The proposal of @ssanderson would probably help here ( "we might want to consider breaking apart the monolithic factor_data format into returns + group columns (which can be shared across multiple analyses) and factor-specific factor value/quantile columns" )

Sounds nice. I'd love to discuss this further.

For problem 2, if factor is computed at 12:30, which in this case is the market close, then '1h' and '2h' should not apply here. The asset must be held at least for one day. Other periods like '1D1h' can be infered accordingly.

I believe the key is not when the factor is computed, but whether the price index contains the factor index. Here I assume 2 things:

  1. The strategy is executed as soon as factor is observed, e.g., factor observed at 12:30, with the price at 12:30 as entry price. This does not support users who want their posistions fulfilld at other prices.

  2. The price index contains the full trading dates/hours. The frequency is high enough so that one factor value at a particular timestamp corresponds to one price. As long as this assumption holds, I believe we can compute the correct timedelta using pandas' CustomBusinessHour, without refering to the factor index.

luca-s commented 6 years ago

For problem 2, if factor is computed at 12:30, which in this case is the market close, then '1h' and '2h' should not apply here. The asset must be held at least for one day. Other periods like '1D1h' can be infered accordingly.

compute_forward_returns doesn't accept '1h' and '2h' as argument , those are inferred from the input. The issue here is that the same periods=(1, 2, 3, 6) can lead to different timedeltas (1h, 3h, 1D, 2D is only one possibility) depending on factor date index.

HereticSK commented 6 years ago

Oh, I understand your point now. I believe what I mentioned above still applies. We can always compute the forward returns for different periods. By '1h' and '1D1h' I mean the infered timedelta. In this case, for period=1, the corresponding timedelta is '1D'; for period=2, the corresponding timedelta is '1D1h'.

We can always compute the forward returns for different periods. But some of them may not apply in a specific factor. For example, there is no corresponding '1h' forward returns for factors computed at 12:30, which is the market close in this example.

The infered timedelta only depends on the trading dates/hours. Can we just let users compute a bunch of periods, where the correct timedelta can be infered, and then match the factor index and forward index in get_clean_factor, so that the applicable periods can be picked?

luca-s commented 6 years ago

@HereticSK if you manage to remove factor from compute_forward_returns that would certainly be an improvement, so I am happy with that. If you manage to do that just add a commit to #270 or create a new PR and we'll review that.

luca-s commented 6 years ago

This is another idea to consider. Since the only problem in passing factor to compute_forward_returns is what @HereticSK highlighted above:

But when it comes to the multi-factor analysis case, if the length of the forward return data frame depends on the factor, the user has to carefully choose a factor to compute forward returns so that the forward returns can be reused in other analysis, since custom factor algorithms may introduce unpredictable number of missing values.

Why don't we let compute_forward_returns accept multiple factors (or multiple factor indices)? Then the code would join all the indices of those factors before computing the forward returns. This would avoid having missing values in the multiple factors scenario.