Two Factor Interaction Development: Initial Data Structure, Test Module, and Plot

MichaelJMath commented 6 years ago

This begins the development of the "factors_interaction_tearsheet" from issue #219. The goal of this pull request is to get feedback on whether this branch seems to be going in the right direction.

Description of Changes

Create join_factor_with_factor_and_forward_returns function
- Creates a function complimentary to get_clean_factor_and_forward_returns that joins an additional factor to the factor_data dataframe returned by get_clean_factor_and_forward_returns.
- This new dataframe returned, call it "multi_factor_data", will be the core source/data structure providing the necessary data for the factors_interaction_tearsheet computations.
Create an associated test module.
Modify perf.mean_return_by_quantile to take an additional parameter so that it can group by multiple factor quantiles.
Add first plotting function, plot_multi_factor_quantile_returns, to create an annotated heatmap of mean returns by two-factor quantile bins.
Create the tears.create_factors_interaction_tear_sheet as the entry point to the multi-factor tearsheet.

Requesting Feedback

Comments and suggestions on the utils.join_factor_with_factor_and_forward_returns function
1. Should there be a wrapper that builds the multi_factor_data dataframe in one step. (i.e. wrap this function with get_clean_factor_and_forward_returns?
I'm not too familiar with creating effective unit tests, so any feedback on this module is appreciated.
In regards to Change 3 above:
1. My first thought, following suggestion of @luca-s, was to create a separate performance module which would contain all functions for this sort of computation.
2. Since the existing performance module already contains a lot of the needed functionality, I thought maybe I would create a wrapper function in this new module that added the necessary functionality.
3. However, in perf.mean_return_by_quantile, I needed to add a parameter to this function to make it work in a clean manner. Not sure how I could have done that with a wrapper.
4. So I guess my question is, what are the community's thoughts on how I dealt with this particular issue, and also what are thoughts on related considerations going forward?
Any other comments/guidance on path of development going forward is greatly appreciated.
Also, let me know if there are too many changes in this pull request for efficient/easy review.

luca-s commented 6 years ago

@MichaelJMath there seems to be a lot of development here, thank you very much for your work. I will start reviewing this PR soon. Would you mind rebasing your branch on quantopian:factors_interactions? I have just rebased it on quantopian:master.

MichaelJMath commented 6 years ago

@luca-s I tried rebasing this morning, and git said everything was up-to-date. I think I mistakenly created the new branch off master instead of factors_interactions.

Would it be easier for me to just close this pull request and open up a duplicate one now that factors_interaction has been rebased on master? I'm assuming this would remove all of the old commits showing on this pull request that are unrelated to this request.

luca-s commented 6 years ago

I believe your solution would work but the scenario we are facing it is a very common one so it should be pretty easy to fix without creating a new PR:

# update your upstream information
git fetch upstream -p 

# rebase on top of quantopian:factors_interactions
git checkout  two_factor_interaction
git rebase upstream/factors_interactions
git push origin two_factor_interaction

MichaelJMath commented 6 years ago

I tried the following at my Bash prompt. I'm still troubleshooting, but figured I'd post my progress.

Mike (factors_interactions) alphalens $ git remote -v
    origin  https://github.com/MichaelJMath/alphalens.git (fetch)
    origin  https://github.com/MichaelJMath/alphalens.git (push)
    upstream        https://github.com/quantopian/alphalens.git (fetch)
    upstream        https://github.com/quantopian/alphalens.git (push)

Mike (factors_interactions) alphalens $ git fetch upstream -p
Mike (factors_interactions) alphalens $ git checkout two_factor_interaction
    Switched to branch 'two_factor_interaction'

Mike (two_factor_interaction) alphalens $ git rebase upstream/factors_interactions
    Current branch two_factor_interaction is up to date.

Mike (two_factor_interaction) alphalens $ git push origin two_factor_interaction
    Everything up-to-date

luca-s commented 6 years ago

So you were right, your branch two_factor_interaction is up-to-date. Could you try pushing with --force option? That might force github to refresh this PR. Otherwise you can go with your initial suggestion of opening a new PR.

twiecki commented 6 years ago

This looks really promising @MichaelJMath!

MichaelJMath commented 6 years ago

Same results for git push --force

Mike (two_factor_interaction) alphalens $ git push origin two_factor_interaction --force
    Everything up-to-date

If I look at my network graph on Github, it is clearly not showing the rebase as my two_factor_interaction branch is still based off master. mjm_network

This is in contrast to the alphalens network graph, which is showing my branch based off factors_interactions. alphalens_network

MichaelJMath commented 6 years ago

I'm closing this PR and opening a duplicate to try and resolve the issue.

luca-s commented 6 years ago

@MichaelJMath since quantopian:factors_interactions is a development branch I rebased/merged your PR there. Now it looks fine and we can all work on this branch.

luca-s commented 6 years ago

@MichaelJMath I had a quick view at the PR, I like your change overall and here are my comments:

Comments and suggestions on the utils.join_factor_with_factor_and_forward_returns function

Should there be a wrapper that builds the multi_factor_data dataframe in one step. (i.e. wrap this function with get_clean_factor_and_forward_returns?

My idea was that the user have to call get_clean_factor_and_forward_returns on both factor1 and factor2 data to obtain factor_data1 and factor_data2 and then to call tears.create_factors_interaction_tear_sheet passing the latter two. Is there an advantage to merge the two factors data together? If so we could call utils.join_factor_with_factor_and_forward_returns inside tears.create_factors_interaction_tear_sheet so that the users can be unaware of this detail.

Please let me know your thought on this.

I'm not too familiar with creating effective unit tests, so any feedback on this module is appreciated.

Your test looks fine. We haven't created a framework for testing tear sheets output yet (see PR #218) so testing only for exceptions is fine now. More generally speaking, there is a test for each functions that performs computation (so the functions inside performance and utils) as you haven't added new computational functions there is no need for more tests. Also it is ok starting with few tests that cover the most common scenarios and then, once bugs are found, we can add more tests to avoid regressions.

As a suggestion, it's better to add tests at the end of the PR when the new API/functionalities are well defined, otherwise you might end up rewriting the tests.

In regards to Change 3 above:

My first thought, following suggestion of @luca-s, was to create a separate performance module which would contain all functions for this sort of computation.

Since the existing performance module already contains a lot of the needed functionality, I thought maybe I would create a wrapper function in this new module that added the necessary functionality.

However, in perf.mean_return_by_quantile, I needed to add a parameter to this function to make it work in a clean manner. Not sure how I could have done that with a wrapper.

So I guess my question is, what are the community's thoughts on how I dealt with this particular issue, and also what are thoughts on related considerations going forward?

I believe your change is fine. There are no general rules, depending on the particular case it might be better to add a wrapper or to add an argument, like you did.

We have to pay particular attention to the API primarily intended for the users (all the tears.* functions and utils.get_clean_factor_and_forward_returns). We try our best to not break them but if we do then we have to document it in the release notes.

Any other comments/guidance on path of development going forward is greatly appreciated.

The more you become familiar with the project the easier will be but I believe you already pay lots of attention to the code you write and I am happy with your PR.

Also, let me know if there are too many changes in this pull request for efficient/easy review.

There is no limits in the amount of changes, but the idea is that you create commits so that who looks at the project history has a clue on what happened. They must summarize each new achievement in the project, a self-contained task, you don't want to enter too many details. Try to describe what you achieved with a commit and not what you did, they should look like a release note.

Also you can use one of this prefixes in the commit message:

API: an (incompatible) API change BLD: change related to building numpy BUG: bug fix DEP: deprecate something, or remove a deprecated object DEV: development tool or utility DOC: documentation ENH: enhancement MAINT: maintenance commit (refactoring, typos, etc.) REV: revert an earlier commit STY: style fix (whitespace, PEP8) TST: addition or modification of tests REL: related to releasing numpy

luca-s commented 6 years ago

I forgot two small details:

You don't want to call plt.show() here and it would be better to follow the example of the other tears.* functions:

create a GridFigure in create_factors_interaction_tear_sheet
pass gf to plot_multi_factor_quantile_returns
call plt.show() and gf.close() at the end of create_factors_interaction_tear_sheet.

Also, this you shouldn't rely on the active figure, it would be better to operate on the passed ax

MichaelJMath commented 6 years ago

My idea was that the user have to call get_clean_factor_and_forward_returns on both factor1 and factor2 data to obtain factor_data1 and factor_data2 and then to call tears.create_factors_interaction_tear_sheet passing the latter two.

On this point, I agree with allowing the user to pass factor1 and factor2 data into get_clean_factor_and_forward_returns. Furthermore, while I am initially focusing development on the two factor case for simplicity, I think more generally, users should be able to pass as many factors as desired (up to a reasonable amount). In the future, I think there will be parts of the interaction/correlation tear sheet that deal with combinations of more than two factors.

Regarding implementation, do you think the factor argument should be allowed to take a DataFrame and/or a list of Series?

Is there an advantage to merge the two factors data together? If so we could call utils.join_factor_with_factor_and_forward_returns inside tears.create_factors_interaction_tear_sheet so that the users can be unaware of this detail.

Initially, I was just trying to keep the code separate for no real good reason. After some thought, I'm thinking there might be good reason to keep the join_factor_with_factor_and_forward_returns function. To add a factor post hoc, the user might not want to rerun get_clean_factor_and_forward_returns as it would duplicate calculations and take longer than necessary.

On the other hand, this function only adds minimal additional functionality. As an alternative, the user could just call quantize_factor and join the factor and factor_quantiles to the "factor_data" DataFrame without the need for the join_factor_with_factor_and_fforward_returns function.

What's your opinion? Do you think this additional function just clutters up the code base without really adding anything major functionality?

MichaelJMath commented 6 years ago

You don't want to call plt.show() here and it would be better to follow the example of the other tears.* functions:

create a GridFigure in create_factors_interaction_tear_sheet pass gf to plot_multi_factor_quantile_returns call plt.show() and gf.close() at the end of create_factors_interaction_tear_sheet. Also, this you shouldn't rely on the active figure, it would be better to operate on the passed ax

This was an oversight on my part. I'll fix and submit a PR.

MichaelJMath commented 6 years ago

@luca-s , Thanks for the advice on how to improve my commit messages. I’ll start using the prefixes, and work on commenting on what I achieved while letting the code show what I did.

luca-s commented 6 years ago

On this point, I agree with allowing the user to pass factor1 and factor2 data into get_clean_factor_and_forward_returns. Furthermore, while I am initially focusing development on the two factor case for simplicity, I think more generally, users should be able to pass as many factors as desired (up to a reasonable amount). In the future, I think there will be parts of the interaction/correlation tear sheet that deal with combinations of more than two factors.

I wasn't very good in my explanation, what I meant was that I would like the user to call get_clean_factor_and_forward_returns on one factor first and then on another factor (not factor1 and factor2 together) and then pass factor1 and factor2 to tears.create_factors_interaction_tear_sheet. My idea was to have a prototype like this:

def create_factors_interaction_tear_sheet(factor_data1, factor_data2):
    ....

Or following your suggestion of handling more than 2 factors, the function could accept a list of factor_data. Also I agree that allowing more than 2 factors makes sense in some parts of the interaction/correlation tear sheet.

def create_factors_interaction_tear_sheet(factor_data_list):
    ....

The idea behind this choice is that factor_data is the "standard" input to most Alphalens functions and I would like to keep it that way to make the new API more user friendly. Internally, create_factors_interaction_tear_sheet can combine the two factors as required to perform the computation, but from the user point of view the new API would follow the current convention.

The issue #253 should be considered too, as it might also come in handy.

Please let me know if you foresee any shortcomings if we go through this path.

MichaelJMath commented 6 years ago

The idea behind this choice is that factor_data is the "standard" input to most Alphalens functions and I would like to keep it that way to make the new API more user friendly. Internally, create_factors_interaction_tear_sheet can combine the two factors as required to perform the computation, but from the user point of view the new API would follow the current convention.

The issue #253 should be considered too, as it might also come in handy.

Please let me know if you foresee any shortcomings if we go through this path.

I follow what you are saying now. There aren't any shortcomings that I foresee. Issue #253 addresses my concern about users having to duplicate the forward returns calculation on each call to get_clean_factor_and_forward_returns. I will follow that thread, and as that issue gets finalized, I will submit another pull request with changes.

ssanderson commented 6 years ago

The idea behind this choice is that factor_data is the "standard" input to most Alphalens functions and I would like to keep it that way to make the new API more user friendly. Internally, create_factors_interaction_tear_sheet can combine the two factors as required to perform the computation, but from the user point of view the new API would follow the current convention.

@luca-s one worry I have about standardizing on this representation is that it means duplicating a fair amount of data if you want to analyze multiple factors. To take an example from the (awesome by the way!) docstring of get_clean_factor_and_forward_returns, the current standard unit of analysis in alphalens looks like this:

           -------------------------------------------------------------------
                      |       | 1D  | 5D  | 10D  |factor|group|factor_quantile
           -------------------------------------------------------------------
               date   | asset |     |     |      |      |     |
           -------------------------------------------------------------------
                      | AAPL  | 0.09|-0.01|-0.079|  0.5 |  G1 |      3
                      --------------------------------------------------------
                      | BA    | 0.02| 0.06| 0.020| -1.1 |  G2 |      5
                      --------------------------------------------------------
           2014-01-01 | CMG   | 0.03| 0.09| 0.036|  1.7 |  G2 |      1
                      --------------------------------------------------------
                      | DAL   |-0.02|-0.06|-0.029| -0.1 |  G3 |      5
                      --------------------------------------------------------
                      | LULU  |-0.03| 0.05|-0.009|  2.7 |  G1 |      2
                      --------------------------------------------------------

Of the six columns in that frame, only two of them (factor and factor_quantile) are specific to the factor under analysis. If a user wants to work with multiple factors, every factor_data will duplicate the forward returns and group columns.

luca-s commented 6 years ago

EDIT: deleted my previous comment. @ssanderson I agree and your point is exactly as @MichaelJMath suggested so let's go with that.

luca-s commented 6 years ago

@MichaelJMath @ssanderson by the way, do we all agree we still need the function utils.join_factor_with_factor_and_fforward_returns which combine multiple factors with forward returns (which will probably make use of the new API developed in #253 )?

quantopian / alphalens

Two Factor Interaction Development: Initial Data Structure, Test Module, and Plot #258