quantopian / alphalens

Performance analysis of predictive (alpha) stock factors
http://quantopian.github.io/alphalens
Apache License 2.0
3.29k stars 1.14k forks source link

Interaction and Correlation Features #219

Closed MichaelJMath closed 4 years ago

MichaelJMath commented 6 years ago

This post describes a possible enhancement, but I'm not sure it fits in with the alphalens project objective as it deals with factor combination. I am proposing it because alphalens already has some of the functions already in place to make this type of analysis easy. However, I realize it may be better included in a separate project.

When combining alpha factors, analysts often look at the correlation between factors to determine whether we are adding anything "new" to the signal or if the two factors simply measure the same thing. (Ideally we would want factors with low correlation).

Additionally, it is interesting to know whether two factors might have an interaction effect where instead of the effects combining in an additive manner, they do so in a multiplicative manner.

References on Interaction Effects: https://courses.washington.edu/smartpsy/interactions.htm https://en.wikipedia.org/wiki/Interaction_(statistics

Below, I have included a link to a Jupyter notebook that uses the size and momentum factors from the Ken French data library as an example of what the output might look like. (The implementation in alphalens if included would be different of course). https://www.dropbox.com/s/owh8rfumuch4c9g/Momentum%20and%20Size%20Interaction-2.ipynb?dl=0

Here is a snippet of the output from the notebook: heatmap interaction plot correlation plot

As mentioned above, I'm not 100% sure this fits in with the alphalens project, as alphalens tends to focus on the analysis of a single alpha factor. However, the reason I bring it up is that alphalens already has some of the functions that make this type of factor combination analysis easy. Essentially, the factor_data dataframe (returned from utils.get_clean_factor_and_forward_returns()) contains the data needed for the interaction analysis (forward returns, factor values and quantiles).

Any feedback or questions are appreciated.

luca-s commented 6 years ago

I see this inter-factor analysis as complementary to Alphalens and I believe it would fit well in the project. I thought about this feature myself few times, even though I didn't think about the specific information I wanted to see.

Given that Alphalens project managers/admins agree with us that this would be a nice addition to the project, then a more robust proposal would need to be formulated: what kind of information would we like to compute? Your two examples are ok as an initial step, but we we need a clear design of what information we like to provide.

My feeling is that a multiple factors analysis is the natural addition to Alphalens, but we need to figure out the big picture before moving forward.

twiecki commented 6 years ago

This looks great, definitely in favor of including something like this. Also relevant when thinking about risk factors. However, not sure why you split this up by "Weak", "Medium", "Strong", rather than regressing on the factor values directly?

MichaelJMath commented 6 years ago

@twiecki I figured by cutting the factors into quantiles, the interaction plot would be simliar to the returns tear sheet where it displays mean return by quantile. It would also allow the user to look at the interaction of classifiers with other classifiers (as opposed to just factors).

That said, I agree with you. We should probably use a regression method as long as a continuous factor is provided for the x-axis. (However, the 2nd factor would still need to be cut into quantiles)

I also like the idea of providing an option as to the regression/smoothing method used (e.g. OLS or LOWESS). Something like LOWESS would allow the user to analyze non-linear relationships. Here is an example of what the plot might look like using some simulated factor and returns data. smoothers

Here is the code used to generate these plots along with some additional output for context:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import statsmodels.api as sm

# Set number of simulated data points
num_obs = 1000

np.random.seed(88)

# Create two uncorrelated factors (starndardized to mu=0, sigma=1)
factor_1 = np.random.normal(size=num_obs)
factor_2 = np.random.normal(size=num_obs)

# Cut Factor_2 into quantiles
q2 = pd.qcut(factor_2, 3, labels=[1,2,3])

# Create Simulated Returns
returns = (np.exp(0.8*factor_1) + 
           0.3*factor_2 + 
           0.5*factor_1*factor_2 + 
           np.random.normal(0,1,num_obs))

# Combine all of the above into a DataFrame
df = pd.DataFrame({'factor_1': factor_1, 
                   'factor_2': factor_2, 
                   'returns': returns, 
                   'quantile_2':q2})
print df.head()

df_sample

def plot_scatter_grid(df):
    """Plot a grid of scatter plots faceted by Factor 2 Quantile """
    cmap = mpl.cm.get_cmap('Set1')
    plt.figure(figsize=(10,10))
    for q in range(3):
        # Get Data for the Quantile q+1
        mask = df['quantile_2']==(q+1)
        data = df.loc[mask,[ 'factor_1', 'returns']]

        # Create Scatter Plot
        color = cmap(q)
        plt.subplot(2,2,q+1)
        data.plot(x='factor_1', y='returns', 
                  kind='scatter', ax=plt.gca() , 
                  color = color, alpha=0.5,
                  title=("Q%s of Factor 2" % (q+1)))

    plt.tight_layout()
    plt.show()

plot_scatter_grid(df)

scatters

def plot_smoothers_by_quantile(df, method='lowess'):
    ax = plt.gca()
    cmap = mpl.cm.get_cmap('Set1')
    for q in range(3):

        mask = df['quantile_2']==(q+1)
        data = df.loc[mask,[ 'factor_1', 'returns']]
        color = cmap(q)

        xs = data['factor_1']
        ys = data['returns']
        # Generate Smoother
        if method=='lowess':
            smoother = sm.nonparametric.lowess(ys, xs )
            smoother_x = smoother[:, 0]
            smoother_y = smoother[:, 1]

        else:
            model = sm.OLS(ys, xs).fit()
            smoother_x = xs
            smoother_y = model.predict(xs)

        plt.plot(smoother_x, smoother_y, color=color, 
                 label=("Q%s of Factor 2" % (q+1)))
        plt.xlabel('Factor 1')
        plt.ylabel('Mean Return')
        plt.legend()
        plt.show()

plt.figure(figsize=(16,5))

# Plot LOWESS Smoothers
plt.subplot(121)
scatter_plot_by_quantile(df, method='lowess')
plt.title('LOWESS Smoother')

# Plot OLS Regression Smoothers
plt.subplot(122)
scatter_plot_by_quantile(df, method='ols')
plt.title("OLS Smoother");

smoothers

MichaelJMath commented 6 years ago

@luca-s If the project managers/admins agree that this is worthwhile to pursue, what would be the best way to brainstorm/submit ideas and/or create a proposal?

twiecki commented 6 years ago

I agree that a linear regression might not be in agreement with how the rest of the library handles things, but splitting it up into quantiles and then doing the regression in every quantile separately is really weird. If we worried about stability / robustness, we could do a rank correlation?

MichaelJMath commented 6 years ago

@twiecki What I was going for initially is what was done in this paper by Xiong and Ibbotson on page 9 in table 2 and 3. ibbotson table

The above pretty much matches the grid/heatmap I posted in my first post (except I used the Fama French Factors). It did not use any form of regression. I simply took the mean of the double-sorted portfolio return bins. I then just plotted that table in a form that is often used to analyze interactions in factorial experiments. See below: factorial experiment

Regarding using a regression, I referenced this article by Richard Williams at the University of Notre Dame, which deals more specifically with interaction analysis of continuous variables. The process would be as follows:

  1. Run a multivariate regression with an interaction term.
  2. Choose a variable of interest to analyze the marginal effects.
  3. Set the other variable to various "values of interest".
  4. Calculate the slope given the regression coefficients and the value from step 3.
  5. Plot the different lines for various "values of interest".

Using the simulated data from earlier, this would come out as follows:

formula

ols results continuous factor interaction

Hopefully this helps clarify. I appreciate the community's feedback.

luca-s commented 6 years ago

@twiecki and @MichaelJMath: if we all agree we could create a feature branch ('factors_interactions'?) where @MichaelJMath can start submitting his PRs for this new analysis and we can also contribute. When the branch reaches a maturity level that we are happy with we can merge that to master.

luca-s commented 6 years ago

I created 'factors_interactions' branch. @MichaelJMath you can now create PRs on that branch if you like and we can all work on that branch too.

MichaelJMath commented 6 years ago

@luca-s, Sounds good. Thanks!

luca-s commented 6 years ago

@MichaelJMath I don't know if you have already thought on how to organize the code and in case you haven't my view is that we could add a new API in alphalens.tear that will be the entry point for the factors Interaction and Correlation tear sheet, something like alphalens.tear.create_factors_interaction_tear_sheet and then we can create another file similar to performance.py, where we can add all the computational functions, while the plotting stuff can go in the existing plotting.py file.

In case you have already planned something don't bother my comment and keep going with your idea.

twiecki commented 6 years ago

I agree with separating this from the rest of the code base.

On Nov 19, 2017 5:24 PM, "luca-s" notifications@github.com wrote:

@MichaelJMath https://github.com/michaeljmath I don't know if you have already thought on how to organize the code and in case you haven't my view is that we could add a new API in alphalens.tear that will be the entry point for the factors Interaction and Correlation tear sheet, something like alphalens.tear.create_factors_interaction_tear_sheet and then we can create another file similar to performance.py, where we can add all the computational functions, while the plotting stuff can go in the existing plotting.py file.

In case you have already planned something don't bother my comment and keep going with your idea.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quantopian/alphalens/issues/219#issuecomment-345529220, or mute the thread https://github.com/notifications/unsubscribe-auth/AApJmAcjB-DwK__E6ttEvYrcMvf7u-3oks5s4FY6gaJpZM4QU4t- .

MichaelJMath commented 6 years ago

@luca-s , That sounds logical to me.

luca-s commented 6 years ago

@MichaelJMath I believe your PR is definitely going in the right direction. I wrote few comments on #258 but other than that I don't see issues.

Just a comment regarding the Interaction Effects plot, in case you are planning to work on that next. I like your initial proposal, the one which consider both factor1 and factor2 split into quantiles and I would add the version which uses regression only later, as the former is more general (works with any factor, not only continuous ones).

Thank you.

luca-s commented 6 years ago

Another thought I had is to use returns decomposition code from #185 to plot how much the returns from one factor can be explained by the other factor. Do you believe this would be interesting?

EDIT: the user could also run the risk analysis tool in pyfolio using one factor returns as the strategy returns and the second factor returns as the risk factor returns

MichaelJMath commented 6 years ago

@luca-s

Another thought I had is to use returns decomposition code from #185 to plot how much the returns from one factor can be explained by the other factor. Do you believe this would be interesting?

Yes, I do think this would be interesting. I have also briefly looked at your open thread #225, which looks interesting. I think both of these ideas go along with the theme of this new tearsheet. Essentially, in this multifactor tear sheet we are trying to find factors that combine to provide a stronger signal. So, in my opinion, this goal seeks to answer questions like:

  1. How original/new is this factor that I am testing?
  2. How much does one factor or combination of factors explain this factor?
  3. Do factors combine additively or multiplicatively?

You mention pyfolio's existing tools and Quantopian's Risk model in #225. I believe all these tools help to answer these questions. As you mention, it would be nice to have this information in alphalens to use in the research phase of algo development before implementing your algo in the backtest phase.

I need to do a little more research on pyfolio and the risk model, which is why I hadn't commented on thread #225 yet.

luca-s commented 6 years ago

So, in my opinion, this goal seeks to answer questions like:

  1. How original/new is this factor that I am testing?
  2. How much does one factor or combination of factors explain this factor?
  3. Do factors combine additively or multiplicatively?

I agree. Having a clear view of what questions we are trying to answer certainly drive the development in the right direction.