Closed neman018 closed 6 years ago
You could skip the call to utils.get_clean_factor_and_forward_returns
and format your data accordingly to what the tears.* functions expect (the format is the return value of utils.get_clean_factor_and_forward_returns
, have a look at the docstrings ).
More generally utils.get_clean_factor_and_forward_returns
is an utility function, you don't have to call it if you don't need it but you can have a look at the source code, that might help you formatting your data .
@all I wonder how often it happens that the returns are available instead of prices. If that's a common scenario we have to think about providing an utility function that prepare and format the input data from returns information.
Any chance you guys will add a utility function for returns (vs stock prices)? I think this might be useful for a lot users who construct their own factor models and consequently custom return streams
If not, I'll try to implement myself..
I believe it is quite common to iterate different factors while returns do not have to change. It would be a lot easier and faster to have a utility function that construct input data from returns.
Currently, I implement this on my own. But I believe it may be a good idea to include this in the package, if it is common enough.
@luca-s Seems like there is indeed quite a bit demand for it.
@neman018 Could you provide some more details about your use case? Could you provide an example API that would suit your case and which would be general enough to be a new Alphalens API?
@HereticSK as you have already implemented it, would it be possible to share your ideas so that we can evaluate them?
Actually we've discussed something similar in #199 . I think the pain point here is that get_clean_factor_and_forward_returns
repeats the slow forward return computation every time. So I simply split out this part from it and reuse everything else.
I make this create_factor_data
function based on get_clean_factor_and_forward_returns
, where computing forward returns is replaced with pre-computed values, and in the signature price
is replaced with forward_returns
, periods
removed.
def create_factor_data(factor,
forward_returns,
groupby=None,
by_group=False,
quantiles=5,
bins=None,
groupby_labels=None):
# Replace calling compute_forward_returns with pre-computed values
merged_data = forward_returns.copy()
# Keep everything else unchanged below
......
Now the analysis code becomes something like:
periods = (20, 40, 60, 120)
forward_returns = alphalens.utils.compute_forward_returns(prices=pricing, periods=periods)
factor_data_1 = create_factor_data(
factor=factor_1,
forward_returns=forward_returns,
quantiles=10
)
factor_data_2 = create_factor_data(
factor=factor_2,
forward_returns=forward_returns,
quantiles=10
)
I find reusing the pre-computed forward returns saves much time, as I often test different factors with the same set of forward returns.
@HereticSK your solution seems reasonable.
What about adding forward_returns
argument to get_clean_factor_and_forward_returns
? Only one between prices
and forward_returns
should be not None: if prices
is provided then the forward_returns
will be computed, otherwise they will be copied like in your code.
The other approach would be to add another function get_clean_factor
which requires forward_returns
but not prices
. In this case get_clean_factor_and_forward_returns
would simply compute forward_returns
from 'prices' and then call get_clean_factor
(which would do all the work currently done by get_clean_factor_and_forward_returns
except forward_returns
computation).
My only concern is to keep clean API, I don't want to make them confusing.
@twiecki What do you think about this?
@neman018 Would the proposed solution be enough to fix your use case too?
@luca-s I like your first proposal.
For the first proposal, it should be noted that both the prices
and periods
arguments are related to forward return computation. For clarity, only one between forward_returns
and (prices, periods)
should not be None, even though allowing periods
not to be None won't cause trouble. I am afraid that would make the validation logic a little bit too verbose. Also, allowing both prices
and forward_returns
seems to make the function do too much, as I perceive.
In that sense, the second proposal looks better to me. But if backward compatibility is of greater concern, the first one is good.
@HereticSK oh right, I forgot about periods
thanks. Well we can simply avoid the check on periods if prices
is not present, as you suggested.
The second proposal wouldn't break compatibility, I Imagine something like this:
# implemented as current get_clean_factor_and_forward_returns and signature except
# for forward_returns computation and periods
get_clean_factor(factor, forward_returns, ....):
...
# same function signature as current implementation but it would call get_clean_factor
get_clean_factor_and_forward_returns(factor, prices, periods, ....):
forward_returns = ... # compute forward_returns
return get_clean_factor(factor, forward_returns, ....)
Still, having a single function seems more user friendly even though, from code point of view, is a little messy.
@luca-s Yes, I agree that a single function is more user friendly.
Yet there is another concern for the first proposal. the name get_clean_factor_and_forward_returns
, as I see it, somehow indicates forward_returns
to be the output, while it accepts forward_returns
as input. This seems a bit confusing. Just my personal perception, don't know if it is a serious problem.
I cannot make up my mind on what is the best approach. I have a slight preference for the second approach, as @HereticSK. it seems cleaner and the API are more obvious, even though we lose the advantage of having a single entry point for the user.
@twiecki So, I am just double checking with you that the first option is the one we want to go for and then move on with the implementation.
@HereticSK would you be interested in providing a PR?
I'd be happy to do it. Let's see how @twiecki thinks. I will be working on it once we decide on which approach to go.
I'm fine with either one. Let's ask our in-house API expert @ssanderson.
On Feb 1, 2018 11:34 AM, "HereticSK" notifications@github.com wrote:
I'd be happy to do it. Let's see how @twiecki https://github.com/twiecki thinks. I will be working on it once we decide on which approach to go.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quantopian/alphalens/issues/253#issuecomment-362225569, or mute the thread https://github.com/notifications/unsubscribe-auth/AApJmDTz17P1yRoLkpZfAMK4awv8K1gLks5tQZNLgaJpZM4RZ8zg .
Sorry, I missed this ping. Will look at this today.
Thanks for pushing this along guys - I think it will be very useful for the community!
I'm in favor of @HereticSK's suggestion in https://github.com/quantopian/alphalens/issues/253#issuecomment-361794546, to factor out a separate create_factor_data
function and have the implementation of get_clean_factor_and_forward_returns
be composed out of get_forward_returns
and create_factor_data
.
I think it's hard in general to make objective arguments about API design, but here's an attempt at a justification for why I prefer this design over adding another optional argument to get_clean_factor_and_forward_returns
:
do_x_and_y
into separate do_x
and do_y
functions. A user can always write do_x_and_y
themselves in terms of do_x
and do_y
, but they can't always extract just the do_x
or the do_y
part if that's all they want or need.
In some cases do_x_and_y
is such a common pattern that it's worth having the builtin for convenience's sake (and typing convenience matters more for a library like alphalens where the expected user is doing interactive analysis than it would for something like a library providing low-level linear algebra primitives), but if we also provide do_x
and do_y
as separate builtins, then advanced users who care about performance can build exactly what they need out of the more primitive operations.Sub-comment: if we separate out a new function, it would be nice to have a more descriptive name than "create_factor_data". Is there a better word than "data" we could use for "stacked dataframe of forward returns plus group and factor information"?
The two metaphors I associate with alphalens are science experiments (e.g looking at the factor data under a microscope) and prospecting (e.g. panning for gold in your factor data). "survey"? "study?", "experiment?", "assay?", "analysis"?
Thanks @ssanderson, I like your arguments.
As I get working on a PR, I have got some second thoughts, hope to discuss more with you guys @luca-s @twiecki @ssanderson .
The current proposal we've agreed on:
# implemented as current get_clean_factor_and_forward_returns and signature except
# for forward_returns computation and periods
get_clean_factor(factor, forward_returns, ....):
...
# same function signature as current implementation but it would call get_clean_factor
get_clean_factor_and_forward_returns(factor, prices, periods, ....):
forward_returns = ... # compute forward_returns
return get_clean_factor(factor, forward_returns, ....)
There are 3 kinds of logic here:
forward_returns
data framefactor
series, return a data frame containing the original factor, the quantized version, and optionally the grouping infomationforward_returns
and the factor data frame into a single factor_data
data frame, report data loss informationIs it a better idea to make get_clean_factor
only responsible for second logic, i.e. quantizing factor, and have a new function get_clean_factor_data
(ignore the naming for the moment) deal with the third one? Slightly different from the current proposal, but seems much cleaner. Something like this:
# get_clean_factor is only responsible for processing factor
get_clean_factor(factor, quantiles, groupby, group_labels, ....):
...
# get_clean_factor_data can be used to combine different sets of
# pre-computed factor and forward_returns
# arguments related to processing factor and forward_returns are
# completely factored out, only max_loss remain
get_clean_factor_data(factor, forward_returns, max_loss):
factor = get_clean_factor(factor, ....)
merged_data = ... # merge forward_returns and factor
# handling data loss information
...
# get_clean_factor_and_forward_returns keeps the current signature
# but implemented as a simple wrapper
get_clean_factor_and_forward_returns(factor, prices, periods, ....):
forward_returns = ... # compute forward_returns
factor = get_clean_factor(factor, forward_returns, ...)
factor_data = get_clean_factor_data(factor, forward_returns)
return factor_data
...
If this is the right way to go, there's something I am not sure. Currently, computing quantized factor involves calling utils.quantize_factor
, whose factor_data
argument assumes that the factor series is already merged into a data frame with forward_returns and grouping columns. It does almost the same thing as get_clean_factor(factor, quantiles, groupby, group_labels, ....)
.
If we accept changing its function signature, there's no need to have a separate get_clean_factor
function. If I am not mistaken, utils.quantize_factor
is not used elsewhere in the package except get_clean_factor_and_forward_returns
. So I think it is safe to change?
As for @ssanderson 's sub-comment about naming, I do agree that the word "data" is a bit too general. I did struggle for some time finding a better name. "factor_data" is the best I can come up with, as it is widely and consistently used in the tears
module as the core data structure. It may be just clear enough, to use in the context of alphalens. Do let me know if there are any better names.
@ssanderson thank you for your comments and for explaining so carefully your reasoning. I agree with your statements and I am also fine with changing factor_data
with some other name but I would suggest to do that in a different PR whose purpose is only refactoring. This is because factor_data
is extensively used in most of Alphalens function signatures, if we change the name we have to do it everywhere in one single PR (and there will be many many changes).
@HereticSK Your proposal seems even cleaner. I believe we can initially build these new API without publicising them in the example NBs. We will see later if they work well or if they need refinements, we can consider them implementation details for now. @ssanderson ?
Currently, computing quantized factor involves calling utils.quantize_factor, whose factor_data argument assumes that the factor series is already merged into a data frame with forward_returns and grouping columns. It does almost the same thing as get_clean_factor(factor, quantiles, groupby, group_labels, ....).
Actually the input of utils.quantize_factor
could be a much simpler DataFrame than factor_data
: it only requires 'factor' column and optionally 'group', it doesn't use forward returns information. We should probably fix the docstring.
If we accept changing its function signature, there's no need to have a separate get_clean_factor function. If I am not mistaken, utils.quantize_factor is not used elsewhere in the package except get_clean_factor_and_forward_returns. So I think it is safe to change?
get_clean_factor
should compute the grouping too, so they perform different computation.
Few comments:
get_clean_factor_data
could be called combine_clean_factor_and_forward_returns
maybe (but I am fine with other names)?get_clean_factor
too, due to the binning@luca-s
If we accept changing its function signature, there's no need to have a separate get_clean_factor function.
Originally I meant having a single function both compute the grouping and quantize the factor, returning a data frame containing the original factor. Something like
quantize_factor(factor, quantiles, groupby, group_labels, ....):
# compute the grouping, the same code as
# what is currently in get_clean_factor_and_forward_returns
if groupby is not None:
......
# compute quantized factor
# return a merged data frame
return merged_data
The reason why I have this idea is:
1) quantized_factor
has to know the grouping information. If this information comes from the input factor_data
, as the current implementation, The user has to create a data frame before calling the function, which can be inconvenient. Alternatively, we can add a groupby
argument to accept the grouping series/dict. In that case, quantized_factor
becomes exactly the same as get_clean_factor
. In either way the grouping info has to come along with the factor, I thought it would be simpler to keep only one function.
2) currently quantized_factor
returns a factor_quantile
series. It is not used anywhere other than inside get_clean_factor_and_forward_returns
. Is there any use cases where we just want a factor_quantile
series rather than a factor_data
data frame?
But now I am hesitant. Keeping quantized_factor
provides more flexiblity and makes get_clean_factor
cleaner.
get_clean_factor_data could be called combine_clean_factor_and_forward_returns maybe (but I am fine with other names)?
combine_clean_factor_and_forward_returns
is good, consistent with the current API get_clean_factor_and_forward_returns
.
Yet get_factor_data
/get_clean_factor_data
has its advantage. It indicates its output to be a factor_data
, which in turn becomes the input of further analysis. I am good with both.
the data loss can happen in get_clean_factor too, due to the binning
This is another detail I am not sure. Currently, both the data loss in the forward_returns computation and the binning phase are reported in get_clean_factor_and_forward_returns
. Should we handle both kind of data loss in combine_clean_factor_and_forward_returns
? Or handle binning loss in
handling in get_clean_factor
, fwd loss in combine_clean_factor_and_forward_returns
?
But now I am hesitant. Keeping quantized_factor provides more flexiblity and makes get_clean_factor cleaner.
I would not change quantized_factor
, it is a well defined function which has a single purpose and does it well.
combine_clean_factor_and_forward_returns is good, consistent with the current API get_clean_factor_and_forward_returns. Yet get_factor_data/get_clean_factor_data has its advantage. It indicates its output to be a factor_data, which in turn becomes the input of further analysis. I am good with both.
I would go with combine_clean_factor_and_forward_returns
This is another detail I am not sure. Currently, both the data loss in the forward_returns computation and the binning phase are reported in get_clean_factor_and_forward_returns. Should we handle both kind of data loss in combine_clean_factor_and_forward_returns? Or handle binning loss in handling in get_clean_factor, fwd loss in combine_clean_factor_and_forward_returns?
Regardless of where the data loss occurs, I would report/print the data loss information in both get_clean_factor_and_forward_returns
and combine_clean_factor_and_forward_returns
, which are the functions that returns factor_data
.
Sorry for the slow reply. This is in my queue to look at tomorrow.
get_clean_factor_data could be called combine_clean_factor_and_forward_returns maybe (but I am fine with other names)?
My vote would also be for combine_clean_factor_and_forward_returns
as it seems to be a bit more intuitive as to what the function actually does.
There are 3 kinds of logic here:
- take prices, compute forward returns, return a forward_returns data frame
- take a factor series, return a data frame containing the original factor, the quantized version, and optionally the grouping infomation
- combine the forward_returns and the factor data frame into a single factor_data data frame, report data loss information
This seems like a reasonable decomposition. I think it might be helpful to write out what we thing the expected use cases would be for these functions. It sounds to me like we think there are three kinds of users who want to construct a factor_data
frame:
factor_data
frame they can pass to other alphalens functions. These users should continue to use get_clean_factor_and_forward_returns
.forward_returns
on their prices and then call get_clean_factor_data(factor, forward_returns, max_loss)
for each of their factors, which saves the computational cost of re-computing forward returns multiple times.get_clean_factor_data(factor, forward_returns)
on each of their factors.Am I missing anything here? From this analysis it seems to me like it doesn't matter much from an API perspective whether we separate out the logic of factor processing and joining the factor with the forward returns; I don't think we expect any users to call combine_factor_and_forward_returns
? It might still be cleaner from an implementation perspective to split out the combination function, but I think that would just be an implementation detail as @luca-s noted above.
As an aside, in the long term I'm not sure I'm convinced we need combining logic at all. I commented in https://github.com/quantopian/alphalens/pull/258#issuecomment-364843134 that the current factor_data
layout forces a lot of duplication if you want to analyze multiple factors.
Depending on how important a use-case it is to analyze multiple factors (and it seems like the desire to efficiently analyze multiple factors is one of the motiviating ideas for this thread), I think we might want to consider breaking apart the monolithic factor_data
format into returns + group columns (which can be shared across multiple analyses) and factor-specific factor value/quantile columns . That's a much larger API change than what we've been discussing so far though, so it probably deserves its own discussion thread.
It seems like we all reached an agreement. Let me summarize and add some more details.
The official API we like to have are:
forward_returns(factor, prices,...)
we can refactor and rename the existing compute_forward_returns
so that it accepts factor
and prices
and also performs the few steps that are currently done inside get_clean_factor_and_forward_returns
, that is:def forward_returns(factor, prices,...)
factor = factor.copy()
factor.index = factor.index.rename(['date', 'asset'])
factor_dateindex = factor.index.get_level_values('date').unique()
# copy here the current body of compute_forward_returns
[...]
get_clean_factor_data(factor, forward_returns, max_loss, ...)
. This would do what get_clean_factor_and_forward_returns
currently does except for the forward_return
computation (It would also accept all the arguments).
get_clean_factor_and_forward_returns
: behave the same as now. It can be re-implemented calling compute_forward_returns
+ get_clean_factor_data
From the implementation point of view we could also have combine_factor_and_forward_returns
and get_clean_factor
but we don't expect to maintain backward compatibility for those.
Regarding the split of factor_data into returns + group columns, I agree we should cover that on a separate thread.
forward_returns(factor, prices,...) we can refactor and rename the existing compute_forward_returns so that it accepts factor and prices and also performs the few steps that are currently done inside get_clean_factor_and_forward_returns
@luca-s is this addressing @ssanderson 's second case?
Users who have prices and multiple factors. These users should call forward_returns on their prices and then call get_clean_factor_data(factor, forward_returns, max_loss) for each of their factors, which saves the computational cost of re-computing forward returns multiple times.
If I'm not mistaken, what he referred to as forward_returns
is exactly the same as the current compute_forward_return
. The calling code for this case should be something like below, which does not seem to require refactoring compute_forward_returns
:
forward_returns = compute_forward_return(prices, periods, ...)
factor_data = get_clean_factor_data(factor, forward_returns, ...)
I believe the logic for computing forward returns is quite independent of factors. Am I missing anything?
That summarizes as to refactor 1 current API function and to introduce 1 new API function:
get_clean_factor_and_forward_returns: behave the same as now. It can be re-implemented calling compute_forward_returns + get_clean_factor_data
get_clean_factor_data(factor, forward_returns, max_loss, ...). This would do what get_clean_factor_and_forward_returns currently does except for the forward_return computation (It would also accept all the arguments).
Since they are considered as the new API, is it a good idea to rename get_clean_factor_data
as get_factor_data_from_forward_returns
and get_clean_factor_and_forward_returns
as get_factor_data_from_prices
? That seems consistent. The current get_clean_factor_and_forward_returns
can be kept as an alias for get_factor_data_from_prices
for backward compatibility.
@luca-s is this addressing @ssanderson 's second case?
Yes
If I'm not mistaken, what he referred to as forward_returns is exactly the same as the current compute_forward_return.
Not exactly, the first argument of compute_forward_return
is factor_idx
and the intent of the little piece of code I posted above was to make the function more user friendly and let it accept factor
and prices
so that you can write:
forward_returns = compute_forward_return(factor, prices, periods, ...)
factor_data = get_clean_factor_data(factor, forward_returns, ...)
We don't need to rename compute_forward_return
to forward_returns
if we don't want to, I just wanted to use the function names used by @ssanderson to avoid confusion.
I believe the logic for computing forward returns is quite independent of factors. Am I missing anything?
It actually needs the factor
index. Is it possible that you are looking at an old version of compute_forward_return
? It has been recently changed.
Regarding the API names I wouldn't rename get_clean_factor_and_forward_returns
but it maybe makes sense to rename get_clean_factor_data
to get_clean_factor
.
oh sorry, I was looking at the old version. Get it now, will try to submit a PR this week.
Hi, guys. I just submitted a PR, implementing the new API as we discussed earlier. Please review and let me know if there are questions.
As I tried to refactor compute_forward_returns
, I wonder, would it be against our original purpose to have compute_forward_returns
depend on the factor? Originally, we want to decouple forward returns and factors so that uses can analyze different factor using the same set of forward returns. Under the current design, compute_forward_returns(factor, prices)
, the user has to recompute forward returns for different factors, unless he/she assumes that the forward returns computed with one factor applies to other factors.
If I am not mistaken, the factor in compute_forward_returns
is only used to infer the trading calendar and get the custom frequency. Is it possible to do this without the factor? Say, assuming the price index as the full trading calendar, consider the set difference between the full calendar and the price calendar as custom holidays, and construct CustomBusinessDay with these holidays?
@HereticSK I understand your point, though it's quite safe to assume that the forward returns computed with one factor applies to other factors, which allow to re-use forward returns. As long as multiple factors have the same date index the forward returns will be the same, e.g. multiple factors computed daily at market open.
I don't believe it would be easy to remove factor
from compute_forward_returns
arguments, there are many scenarios covered by Alphalens where your proposed solution wouldn't work. Have a look at intraday NB example, where the prices are those:
Given those prices and periods=(1, 2, 3, 6)
the function compute_forward_returns
has to infer that the actual time periods are 1h, 3h, 1D, 2D. Without having the factor date index, which gives the information of when the trades happen, I don't believe the function can compute the required output.
Also, I thought again about the new functions names (get_clean_factor_and_forward_returns
and get_clean_factor
) and I understand why you prefer to change the names, the current ones are not too explanatory. It might be good to rename the functions but please note that if we change get_clean_factor_and_forward_returns
then all example NBs and many tests have to be updated too. If we really want to go through this path then it would be better to create a separate PR, which would involve only the renaming. That would create an easier to read git history.
EDIT:
my comment on creating a separate PR doesn't make any sense, as we can have multiple commits on the same PR. So as long as @twiecki and @ssanderson agree, we can rename get_clean_factor_and_forward_returns
(and change example NBs and tests accordingly)
Without having the factor date index, which gives the information of when the trades happen, I don't believe the function can compute the required output.
Sorry, I am not following.
As illustrated in the intraday_factor example:
The pricing data passed to alphalens should contain the entry price for the assets so it must reflect the next available price after a factor value was observed at a given timestamp. ...... The pricing data must also contain the exit price for the assets
As long as the factor timestamps is a subset of the price timestamps, a given factor timestamp can always find its corresponding entry price, and thus the corresponding forward return.
In terms of the code, factor_idx
is first intersected with prices.index
, and then passed to infer_trading_calendar
. From here on factor_idx
becomes a subset of prices.index
factor_idx = factor_idx.intersection(prices.index)
forward_returns = pd.DataFrame(index=pd.MultiIndex.from_product(
[factor_idx, prices.columns], names=['date', 'asset']))
freq = infer_trading_calendar(factor_idx, prices.index)
And then in infer_trading_calendar
, factor_idx
and prices.index
are unioned:
def infer_trading_calendar(factor_idx, prices_idx):
full_idx = factor_idx.union(prices_idx)
Is it a round-trip here? Is full_idx
always equals prices_idx
? Am I missing anything?
The following piece of code affects the shape of the forward_returns dafa frame. Can this be put in get_clean_factor
?
for period in periods:
#
# build forward returns
#
fwdret = prices.pct_change(period).shift(-period).reindex(factor_idx)
As long as the factor timestamps is a subset of the price timestamps, a given factor timestamp can always find its corresponding entry price, and thus the corresponding forward return.
I agree but there are two subtleties to consider:
factor
date index, we don't need the forward returns computed for every date in prices
. Given that one of the point of this issue is to save computation time we don't want to increase the computation required for the common scenario of a single factor. In the intraday example if we computed the forward returns for each date in prices index we would increase the computation time by 3 times (prices
index is 3 times larger than factor
index) and the memory required too.Is it a round-trip here? Is full_idx always equals prices_idx? Am I missing anything?
We don't know if prices
is correctly populated, it might miss some dates from factor
. To be correct in the computation we perform a union
of the two indices instead of relying on the correctness of prices
.
The following piece of code affects the shape of the forward_returns data frame. Can this be put in get_clean_factor?
I don't get it. If the code affects forward returns, isn't it ok to have it inside compute_forward_returns
?
we don't want to increase the computation required for the common scenario of a single factor
fwdret = prices.pct_change(period).shift(-period).reindex(factor_idx)
forward returns is actually computed for all prices dates, and then part of it discarded. It is the time for stacking data frames and combing columns that is saved, rather than that for computing forward returns. I understand that It can be significantly more expensive to concatenate columns than computing pct_changes. In this case I agree that there is a performance gain.
But when it comes to the multi-factor analysis case, if the length of the forward return data frame depends on the factor, the user has to carefully choose a factor to compute forward returns so that the forward returns can be reused in other analysis, since custom factor algorithms may introduce unpredictable number of missing values.
Maybe it does not need to subset the forward returns frame in compute_forward_returns
. Aligning factor and forward_returns is eventually handled in get_clean_factor(factor, forward_returns)
.
It seems to be a trade-off here. Personally, I think in the single factor scenario forward returns are computed only once, so the overhead is acceptable, and it may not often be significant. On the other hand, of course, multi-factor users can, in a tricky way, get the full-date forward returns by passing the prices itself as a factor.
We don't know if prices is correctly populated, it might miss some dates from factor. To be correct in the computation we perform a union of the two indices instead of relying on the correctness of prices
- In practice, I assume that pricing data, as raw data, is less likely to omit any trading dates, relative to factors computed with custom algorithms. This is reliable at least in my local market. Is this assumption so weak in other markets that we have to pay special attention?
full_idx = factor_idx.union(prices_idx)
, where factor_idx
has already become the result of the intersection between factor_idx
and prices_idx
, effectively a subset of prices_idx
. The union of a subset and prices_idx
itself always yields prices_idx
itself. In other words, factor_idx
does not matter here. Am I missing anything?Maybe it does not need to subset the forward returns frame in compute_forward_returns. Aligning factor and forward_returns is eventually handled in get_clean_factor(factor, forward_returns).
I guess we could accept a little decrease in performance, but how could you fix the problem 2 from my previous comment?
The proposal of @ssanderson would probably help here ( "we might want to consider breaking apart the monolithic factor_data format into returns + group columns (which can be shared across multiple analyses) and factor-specific factor value/quantile columns" )
In terms of the code, full_idx = factor_idx.union(prices_idx), where factor_idx has already become the result of the intersection between factor_idx and prices_idx, effectively a subset of prices_idx. The union of a subset and prices_idx itself always yields prices_idx itself. In other words, factor_idx does not matter here. Am I missing anything?
No, you are right. infer_trading_calendar
should have been called at the beginning of the function before the factor index mangling.Could you fix that too in the PR?
The proposal of @ssanderson would probably help here ( "we might want to consider breaking apart the monolithic factor_data format into returns + group columns (which can be shared across multiple analyses) and factor-specific factor value/quantile columns" )
Sounds nice. I'd love to discuss this further.
For problem 2, if factor is computed at 12:30, which in this case is the market close, then '1h' and '2h' should not apply here. The asset must be held at least for one day. Other periods like '1D1h' can be infered accordingly.
I believe the key is not when the factor is computed, but whether the price index contains the factor index. Here I assume 2 things:
The strategy is executed as soon as factor is observed, e.g., factor observed at 12:30, with the price at 12:30 as entry price. This does not support users who want their posistions fulfilld at other prices.
The price index contains the full trading dates/hours. The frequency is high enough so that one factor value at a particular timestamp corresponds to one price. As long as this assumption holds, I believe we can compute the correct timedelta using pandas' CustomBusinessHour
, without refering to the factor index.
For problem 2, if factor is computed at 12:30, which in this case is the market close, then '1h' and '2h' should not apply here. The asset must be held at least for one day. Other periods like '1D1h' can be infered accordingly.
compute_forward_returns
doesn't accept '1h' and '2h' as argument , those are inferred from the input. The issue here is that the same periods=(1, 2, 3, 6)
can lead to different timedeltas (1h, 3h, 1D, 2D is only one possibility) depending on factor date index.
Oh, I understand your point now. I believe what I mentioned above still applies. We can always compute the forward returns for different periods. By '1h' and '1D1h' I mean the infered timedelta. In this case, for period=1, the corresponding timedelta is '1D'; for period=2, the corresponding timedelta is '1D1h'.
We can always compute the forward returns for different periods. But some of them may not apply in a specific factor. For example, there is no corresponding '1h' forward returns for factors computed at 12:30, which is the market close in this example.
The infered timedelta only depends on the trading dates/hours. Can we just let users compute a bunch of periods, where the correct timedelta can be infered, and then match the factor index and forward index in get_clean_factor
, so that the applicable periods can be picked?
@HereticSK if you manage to remove factor
from compute_forward_returns
that would certainly be an improvement, so I am happy with that. If you manage to do that just add a commit to #270 or create a new PR and we'll review that.
This is another idea to consider. Since the only problem in passing factor
to compute_forward_returns
is what @HereticSK highlighted above:
But when it comes to the multi-factor analysis case, if the length of the forward return data frame depends on the factor, the user has to carefully choose a factor to compute forward returns so that the forward returns can be reused in other analysis, since custom factor algorithms may introduce unpredictable number of missing values.
Why don't we let compute_forward_returns
accept multiple factors (or multiple factor indices)? Then the code would join all the indices of those factors before computing the forward returns. This would avoid having missing values in the multiple factors scenario.
Is there a way to run Alphalens using my own custom input matrix of returns (ie. custom factor adjusted) rather than inputting Prices (ie. the "pricing" table in the examples) for each period?