Closed MichaelJMath closed 6 years ago
@MichaelJMath there seems to be a lot of development here, thank you very much for your work. I will start reviewing this PR soon. Would you mind rebasing your branch on quantopian:factors_interactions? I have just rebased it on quantopian:master.
@luca-s I tried rebasing this morning, and git said everything was up-to-date. I think I mistakenly created the new branch off master instead of factors_interactions.
Would it be easier for me to just close this pull request and open up a duplicate one now that factors_interaction has been rebased on master? I'm assuming this would remove all of the old commits showing on this pull request that are unrelated to this request.
I believe your solution would work but the scenario we are facing it is a very common one so it should be pretty easy to fix without creating a new PR:
# update your upstream information
git fetch upstream -p
# rebase on top of quantopian:factors_interactions
git checkout two_factor_interaction
git rebase upstream/factors_interactions
git push origin two_factor_interaction
I tried the following at my Bash prompt. I'm still troubleshooting, but figured I'd post my progress.
Mike (factors_interactions) alphalens $ git remote -v
origin https://github.com/MichaelJMath/alphalens.git (fetch)
origin https://github.com/MichaelJMath/alphalens.git (push)
upstream https://github.com/quantopian/alphalens.git (fetch)
upstream https://github.com/quantopian/alphalens.git (push)
Mike (factors_interactions) alphalens $ git fetch upstream -p
Mike (factors_interactions) alphalens $ git checkout two_factor_interaction
Switched to branch 'two_factor_interaction'
Mike (two_factor_interaction) alphalens $ git rebase upstream/factors_interactions
Current branch two_factor_interaction is up to date.
Mike (two_factor_interaction) alphalens $ git push origin two_factor_interaction
Everything up-to-date
So you were right, your branch two_factor_interaction
is up-to-date. Could you try pushing with --force option? That might force github to refresh this PR. Otherwise you can go with your initial suggestion of opening a new PR.
This looks really promising @MichaelJMath!
Same results for git push --force
Mike (two_factor_interaction) alphalens $ git push origin two_factor_interaction --force
Everything up-to-date
If I look at my network graph on Github, it is clearly not showing the rebase as my two_factor_interaction branch is still based off master.
This is in contrast to the alphalens network graph, which is showing my branch based off factors_interactions.
I'm closing this PR and opening a duplicate to try and resolve the issue.
@MichaelJMath since quantopian:factors_interactions
is a development branch I rebased/merged your PR there. Now it looks fine and we can all work on this branch.
@MichaelJMath I had a quick view at the PR, I like your change overall and here are my comments:
- Comments and suggestions on the
utils.join_factor_with_factor_and_forward_returns
function
- Should there be a wrapper that builds the multi_factor_data dataframe in one step. (i.e. wrap this function with
get_clean_factor_and_forward_returns
?
My idea was that the user have to call get_clean_factor_and_forward_returns
on both factor1 and factor2 data to obtain factor_data1 and factor_data2 and then to call tears.create_factors_interaction_tear_sheet
passing the latter two. Is there an advantage to merge the two factors data together? If so we could call utils.join_factor_with_factor_and_forward_returns
inside tears.create_factors_interaction_tear_sheet
so that the users can be unaware of this detail.
Please let me know your thought on this.
- I'm not too familiar with creating effective unit tests, so any feedback on this module is appreciated.
Your test looks fine. We haven't created a framework for testing tear sheets output yet (see PR #218) so testing only for exceptions is fine now. More generally speaking, there is a test for each functions that performs computation (so the functions inside performance and utils) as you haven't added new computational functions there is no need for more tests. Also it is ok starting with few tests that cover the most common scenarios and then, once bugs are found, we can add more tests to avoid regressions.
As a suggestion, it's better to add tests at the end of the PR when the new API/functionalities are well defined, otherwise you might end up rewriting the tests.
- In regards to Change 3 above:
- My first thought, following suggestion of @luca-s, was to create a separate performance module which would contain all functions for this sort of computation.
- Since the existing performance module already contains a lot of the needed functionality, I thought maybe I would create a wrapper function in this new module that added the necessary functionality.
- However, in
perf.mean_return_by_quantile
, I needed to add a parameter to this function to make it work in a clean manner. Not sure how I could have done that with a wrapper.- So I guess my question is, what are the community's thoughts on how I dealt with this particular issue, and also what are thoughts on related considerations going forward?
I believe your change is fine. There are no general rules, depending on the particular case it might be better to add a wrapper or to add an argument, like you did.
We have to pay particular attention to the API primarily intended for the users (all the tears.* functions and utils.get_clean_factor_and_forward_returns). We try our best to not break them but if we do then we have to document it in the release notes.
- Any other comments/guidance on path of development going forward is greatly appreciated.
The more you become familiar with the project the easier will be but I believe you already pay lots of attention to the code you write and I am happy with your PR.
- Also, let me know if there are too many changes in this pull request for efficient/easy review.
There is no limits in the amount of changes, but the idea is that you create commits so that who looks at the project history has a clue on what happened. They must summarize each new achievement in the project, a self-contained task, you don't want to enter too many details. Try to describe what you achieved with a commit and not what you did, they should look like a release note.
Also you can use one of this prefixes in the commit message:
API: an (incompatible) API change BLD: change related to building numpy BUG: bug fix DEP: deprecate something, or remove a deprecated object DEV: development tool or utility DOC: documentation ENH: enhancement MAINT: maintenance commit (refactoring, typos, etc.) REV: revert an earlier commit STY: style fix (whitespace, PEP8) TST: addition or modification of tests REL: related to releasing numpy
I forgot two small details:
You don't want to call plt.show() here and it would be better to follow the example of the other tears.* functions:
GridFigure
in create_factors_interaction_tear_sheet
gf
to plot_multi_factor_quantile_returns
plt.show()
and gf.close()
at the end of create_factors_interaction_tear_sheet
.Also, this you shouldn't rely on the active figure, it would be better to operate on the passed ax
My idea was that the user have to call get_clean_factor_and_forward_returns on both factor1 and factor2 data to obtain factor_data1 and factor_data2 and then to call tears.create_factors_interaction_tear_sheet passing the latter two.
On this point, I agree with allowing the user to pass factor1 and factor2 data into get_clean_factor_and_forward_returns
. Furthermore, while I am initially focusing development on the two factor case for simplicity, I think more generally, users should be able to pass as many factors as desired (up to a reasonable amount). In the future, I think there will be parts of the interaction/correlation tear sheet that deal with combinations of more than two factors.
Regarding implementation, do you think the factor argument should be allowed to take a DataFrame and/or a list of Series?
Is there an advantage to merge the two factors data together? If so we could call utils.join_factor_with_factor_and_forward_returns inside tears.create_factors_interaction_tear_sheet so that the users can be unaware of this detail.
Initially, I was just trying to keep the code separate for no real good reason. After some thought, I'm thinking there might be good reason to keep the join_factor_with_factor_and_forward_returns
function. To add a factor post hoc, the user might not want to rerun get_clean_factor_and_forward_returns
as it would duplicate calculations and take longer than necessary.
On the other hand, this function only adds minimal additional functionality. As an alternative, the user could just call quantize_factor
and join the factor and factor_quantiles to the "factor_data" DataFrame without the need for the join_factor_with_factor_and_fforward_returns
function.
What's your opinion? Do you think this additional function just clutters up the code base without really adding anything major functionality?
You don't want to call plt.show() here and it would be better to follow the example of the other tears.* functions:
create a GridFigure in create_factors_interaction_tear_sheet pass gf to plot_multi_factor_quantile_returns call plt.show() and gf.close() at the end of create_factors_interaction_tear_sheet. Also, this you shouldn't rely on the active figure, it would be better to operate on the passed ax
This was an oversight on my part. I'll fix and submit a PR.
@luca-s , Thanks for the advice on how to improve my commit messages. I’ll start using the prefixes, and work on commenting on what I achieved while letting the code show what I did.
On this point, I agree with allowing the user to pass factor1 and factor2 data into get_clean_factor_and_forward_returns. Furthermore, while I am initially focusing development on the two factor case for simplicity, I think more generally, users should be able to pass as many factors as desired (up to a reasonable amount). In the future, I think there will be parts of the interaction/correlation tear sheet that deal with combinations of more than two factors.
I wasn't very good in my explanation, what I meant was that I would like the user to call get_clean_factor_and_forward_returns
on one factor first and then on another factor (not factor1 and factor2 together) and then pass factor1 and factor2 to tears.create_factors_interaction_tear_sheet.
My idea was to have a prototype like this:
def create_factors_interaction_tear_sheet(factor_data1, factor_data2):
....
Or following your suggestion of handling more than 2 factors, the function could accept a list of factor_data. Also I agree that allowing more than 2 factors makes sense in some parts of the interaction/correlation tear sheet.
def create_factors_interaction_tear_sheet(factor_data_list):
....
The idea behind this choice is that factor_data
is the "standard" input to most Alphalens functions and I would like to keep it that way to make the new API more user friendly. Internally, create_factors_interaction_tear_sheet can combine the two factors as required to perform the computation, but from the user point of view the new API would follow the current convention.
The issue #253 should be considered too, as it might also come in handy.
Please let me know if you foresee any shortcomings if we go through this path.
The idea behind this choice is that factor_data is the "standard" input to most Alphalens functions and I would like to keep it that way to make the new API more user friendly. Internally, create_factors_interaction_tear_sheet can combine the two factors as required to perform the computation, but from the user point of view the new API would follow the current convention.
The issue #253 should be considered too, as it might also come in handy.
Please let me know if you foresee any shortcomings if we go through this path.
I follow what you are saying now. There aren't any shortcomings that I foresee. Issue #253 addresses my concern about users having to duplicate the forward returns calculation on each call to get_clean_factor_and_forward_returns
. I will follow that thread, and as that issue gets finalized, I will submit another pull request with changes.
The idea behind this choice is that factor_data is the "standard" input to most Alphalens functions and I would like to keep it that way to make the new API more user friendly. Internally, create_factors_interaction_tear_sheet can combine the two factors as required to perform the computation, but from the user point of view the new API would follow the current convention.
@luca-s one worry I have about standardizing on this representation is that it means duplicating a fair amount of data if you want to analyze multiple factors. To take an example from the (awesome by the way!) docstring of get_clean_factor_and_forward_returns
, the current standard unit of analysis in alphalens looks like this:
-------------------------------------------------------------------
| | 1D | 5D | 10D |factor|group|factor_quantile
-------------------------------------------------------------------
date | asset | | | | | |
-------------------------------------------------------------------
| AAPL | 0.09|-0.01|-0.079| 0.5 | G1 | 3
--------------------------------------------------------
| BA | 0.02| 0.06| 0.020| -1.1 | G2 | 5
--------------------------------------------------------
2014-01-01 | CMG | 0.03| 0.09| 0.036| 1.7 | G2 | 1
--------------------------------------------------------
| DAL |-0.02|-0.06|-0.029| -0.1 | G3 | 5
--------------------------------------------------------
| LULU |-0.03| 0.05|-0.009| 2.7 | G1 | 2
--------------------------------------------------------
Of the six columns in that frame, only two of them (factor
and factor_quantile
) are specific to the factor under analysis. If a user wants to work with multiple factors, every factor_data
will duplicate the forward returns and group columns.
EDIT: deleted my previous comment. @ssanderson I agree and your point is exactly as @MichaelJMath suggested so let's go with that.
@MichaelJMath @ssanderson by the way, do we all agree we still need the function utils.join_factor_with_factor_and_fforward_returns
which combine multiple factors with forward returns
(which will probably make use of the new API developed in #253 )?
This begins the development of the "factors_interaction_tearsheet" from issue #219. The goal of this pull request is to get feedback on whether this branch seems to be going in the right direction.
Description of Changes
join_factor_with_factor_and_forward_returns
functionget_clean_factor_and_forward_returns
that joins an additional factor to thefactor_data
dataframe returned byget_clean_factor_and_forward_returns
.multi_factor_data
", will be the core source/data structure providing the necessary data for the factors_interaction_tearsheet computations.perf.mean_return_by_quantile
to take an additional parameter so that it can group by multiple factor quantiles.plot_multi_factor_quantile_returns
, to create an annotated heatmap of mean returns by two-factor quantile bins.tears.create_factors_interaction_tear_sheet
as the entry point to the multi-factor tearsheet.Requesting Feedback
utils.join_factor_with_factor_and_forward_returns
functionget_clean_factor_and_forward_returns
?perf.mean_return_by_quantile
, I needed to add a parameter to this function to make it work in a clean manner. Not sure how I could have done that with a wrapper.