robertmartin8 / PyPortfolioOpt

Financial portfolio optimisation in python, including classical efficient frontier, Black-Litterman, Hierarchical Risk Parity
https://pyportfolioopt.readthedocs.io/
MIT License
4.52k stars 957 forks source link

The objective is not DCP. #82

Closed filipeteotonio closed 4 years ago

filipeteotonio commented 4 years ago

I've being using PyPortfolioOpt for a couple of days now and I'm finding it very interesting. However, it seems that for some combination of assets in my dataset, the max_sharpe method will not work. The error comes from cvxpy:

DCPError: Problem does not follow DCP rules. Specifically:
The objective is not DCP. Its following subexpressions are not:
QuadForm(var20890, [[1.28992571 0.05675357 0.23213086 0.04030597]
 [0.05675357 0.39647776 0.3670981  0.09351321]
 [0.23213086 0.3670981  0.37204403 0.19053016]
 [0.04030597 0.09351321 0.19053016 0.17553035]])

Any ideas on why this error would occurr?

robertmartin8 commented 4 years ago

Hi Filipe,

Thanks for reaching out! Could you provide a sample of the code you are using? Or let me know if you have added any constraints and objectives.

Best, Robert

On Thu, 9 Apr 2020, 00:47 Filipe Mendonça, notifications@github.com wrote:

I've being using PyPortfolioOpt for a couple of days now and I'm finding it very interesting. However, it seems that for some combination of assets in my dataset, the max_sharpe method will not work. The error comes from cvxpy:

DCPError: Problem does not follow DCP rules. Specifically: The objective is not DCP. Its following subexpressions are not: QuadForm(var20890, [[1.28992571 0.05675357 0.23213086 0.04030597] [0.05675357 0.39647776 0.3670981 0.09351321] [0.23213086 0.3670981 0.37204403 0.19053016] [0.04030597 0.09351321 0.19053016 0.17553035]])

Any ideas on why this error would occurr?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/robertmartin8/PyPortfolioOpt/issues/82, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ7ECNJ4WIGJTOBLPGK4ZTRLSTCBANCNFSM4MECMQYA .

robertmartin8 commented 4 years ago

oops, accidentally closed.

as-com commented 4 years ago

I was able to trigger this error with the example code in the readme:

import pandas as pd
from pypfopt import EfficientFrontier
from pypfopt import risk_models
from pypfopt import expected_returns

# Read in price data
df = pd.read_csv("stonks.csv", parse_dates=True, index_col="Date")

# Calculate expected returns and sample covariance
mu = expected_returns.mean_historical_return(df)
S = risk_models.sample_cov(df)

# Optimise for maximal Sharpe ratio
ef = EfficientFrontier(mu, S)
raw_weights = ef.max_sharpe()
cleaned_weights = ef.clean_weights()
ef.save_weights_to_file("weights.csv")  # saves to file
print(cleaned_weights)
ef.portfolio_performance(verbose=True)

And this CSV file: stonks.zip

robertmartin8 commented 4 years ago

@as-com

Thanks so much! I'm working on it now. One of my preliminary (strange) discoveries is that the covariance matrix is not actually positive semidefinite, which explains why the optimiser is failing (it also fails for min_volatility!).

# cov matrix from stonks.csv
S = np.array([[0.03818144, 0.04182824],
                     [0.04182824, 0.04149209]])
S.values.trace()  # gives 0.0797
np.linalg.det(S)  # gives -0.00016

For a 2x2 matrix, the matrix is positive semidefinite if and only if both the trace and determinant are ≥ 0.

jonathanng commented 4 years ago

For numerical stability, you should add a small amount of noise to the diagonal.

robertmartin8 commented 4 years ago

@jonathanng

Thanks for the suggestion! I'll look into it. But surely the noise isn't guaranteed to make it positive semidefinite?

jonathanng commented 4 years ago

https://math.stackexchange.com/questions/665026/adding-elements-to-diagonal-of-symmetric-matrix-to-ensure-positive-definiteness

robertmartin8 commented 4 years ago

@jonathanng

Thanks for the link, I'll give it a read. I'm also considering a simpler solution of just clipping the eigenvalues at zero then reconstructing the matrix (as described here). @schneiderfelipe do you have any thoughts on this? Your linear algebra skills are much better than mine 😂

Anyway, @filipeteotonio, I think the best solution for the time being is to use a different estimator of the covariance matrix. I'd suggest the following:

S = risk_models.CovarianceShrinkage(df).ledoit_wolf()
schneiderfelipe commented 4 years ago

The following is a long reply.

Previous behavior

I was able to trigger this error with the example code in the readme:

import pandas as pd
from pypfopt import EfficientFrontier
from pypfopt import risk_models
from pypfopt import expected_returns

# Read in price data
df = pd.read_csv("stonks.csv", parse_dates=True, index_col="Date")

# Calculate expected returns and sample covariance
mu = expected_returns.mean_historical_return(df)
S = risk_models.sample_cov(df)

# Optimise for maximal Sharpe ratio
ef = EfficientFrontier(mu, S)
raw_weights = ef.max_sharpe()
cleaned_weights = ef.clean_weights()
ef.save_weights_to_file("weights.csv")  # saves to file
print(cleaned_weights)
ef.portfolio_performance(verbose=True)

And this CSV file: stonks.zip

I made this code work with PyPortfolioOpt version 0.5.1. The calculated covariance matrix is the same:

➜  ~ ipython3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd 
   ...: from pypfopt.efficient_frontier import EfficientFrontier 
   ...: from pypfopt import risk_models 
   ...: from pypfopt import expected_returns 
   ...:  
   ...: # Read in price data 
   ...: df = pd.read_csv("stonks.csv", parse_dates=True, index_col="Date") 
   ...:  
   ...: # Calculate expected returns and sample covariance 
   ...: mu = expected_returns.mean_historical_return(df) 
   ...: S_ = risk_models.sample_cov(df) 
   ...:  
   ...: # Optimise for maximal Sharpe ratio 
   ...: ef = EfficientFrontier(mu, S_) 
   ...: raw_weights = ef.max_sharpe() 
   ...: cleaned_weights = ef.clean_weights() 
   ...: ef.save_weights_to_file("weights.csv")  # saves to file 
   ...: print(cleaned_weights) 
   ...: ef.portfolio_performance(verbose=True)                                  
{'VTI': 0.0, 'MGK': 1.0}
Expected annual return: 11.4%
Annual volatility: 20.4%
Sharpe Ratio: 0.46
Out[1]: (0.11443438934256847, 0.20369607211761417, 0.46360437077079236)

In [2]: S_
Out[2]: 
          VTI       MGK
VTI  0.038181  0.041828
MGK  0.041828  0.041492

And not positive semidefinite, as observed by @robertmartin8 (the first eigenvalue, -0.00202422, is less than zero):

In [3]: import numpy as np                                                      

In [4]: np.linalg.eigh(S_)                                                       
Out[4]: 
(array([-0.00202422,  0.08169775]), array([[-0.72095193,  0.69298507],
        [ 0.69298507,  0.72095193]]))

I could reproduce the error with version 1.0.2 as well. I did all this to have something to compare.

Why not positive definite?

risk_models.sample_cov might produce a matrix with negative eigenvalues if one of them is close to zero (so that it randomly flips sign) or, as suggested here, if some data are missing. There are two ways we might go around this:

I implemented both methods in the function below:

import numpy as np

def sample_cov(prices, frequency=252, method=None): 
    returns = expected_returns.returns_from_prices(prices) 
    S = returns.cov() 
    if method is not None: 
        w, U = np.linalg.eigh(S) 
    if method == "diag": 
        mw = np.min(w) 
        if mw < 0: 
            S -= mw * np.eye(len(S)) 
    elif method == "spectral": 
        w = np.where(w > 0, w, 0) 
        S = U @ np.diag(w) @ U.T 
    return S * frequency 

method="diag"

In [5]: import pandas as pd 
   ...: from pypfopt.efficient_frontier import EfficientFrontier 
   ...: from pypfopt import risk_models 
   ...: from pypfopt import expected_returns 
   ...:  
   ...: # Read in price data 
   ...: df = pd.read_csv("stonks.csv", parse_dates=True, index_col="Date") 
   ...:  
   ...: # Calculate expected returns and sample covariance 
   ...: mu = expected_returns.mean_historical_return(df) 
   ...: S = sample_cov(df, method="diag") 
   ...:  
   ...: # Optimise for maximal Sharpe ratio 
   ...: ef = EfficientFrontier(mu, S) 
   ...: raw_weights = ef.max_sharpe() 
   ...: cleaned_weights = ef.clean_weights() 
   ...: ef.save_weights_to_file("weights.csv")  # saves to file 
   ...: print(cleaned_weights) 
   ...: ef.portfolio_performance(verbose=True)                                  
{'VTI': 0.0, 'MGK': 1.0}
Expected annual return: 11.4%
Annual volatility: 20.9%
Sharpe Ratio: 0.45
Out[5]: (0.11443438934256848, 0.20860562383796466, 0.4526934010941182)

In [6]: S                                                                       
Out[6]: 
          VTI       MGK
VTI  0.040206  0.041828
MGK  0.041828  0.043516

In [7]: S - S_                                                                  
Out[7]: 
          VTI       MGK
VTI  0.002024  0.000000
MGK  0.000000  0.002024

In [8]: np.linalg.norm(S - S_)                                                  
Out[8]: 0.0028626744284440757

Some remarks:

method="spectral"

In [9]: import pandas as pd 
   ...: from pypfopt.efficient_frontier import EfficientFrontier 
   ...: from pypfopt import risk_models 
   ...: from pypfopt import expected_returns 
   ...:  
   ...: # Read in price data 
   ...: df = pd.read_csv("stonks.csv", parse_dates=True, index_col="Date") 
   ...:  
   ...: # Calculate expected returns and sample covariance 
   ...: mu = expected_returns.mean_historical_return(df) 
   ...: S = sample_cov(df, method="spectral") 
   ...:  
   ...: # Optimise for maximal Sharpe ratio 
   ...: ef = EfficientFrontier(mu, S) 
   ...: raw_weights = ef.max_sharpe() 
   ...: cleaned_weights = ef.clean_weights() 
   ...: ef.save_weights_to_file("weights.csv")  # saves to file 
   ...: print(cleaned_weights) 
   ...: ef.portfolio_performance(verbose=True)                                  
{'VTI': 0.0, 'MGK': 1.0}
Expected annual return: 11.4%
Annual volatility: 20.6%
Sharpe Ratio: 0.46
Out[9]: (0.11443438934256848, 0.20606837668700312, 0.4582672550772053)

In [10]: S                                                                      
Out[10]: 
array([[0.03923357, 0.04081692],
       [0.04081692, 0.04246418]])

In [11]: S - S_                                                                  
Out[11]: 
          VTI       MGK
VTI  0.001052 -0.001011
MGK -0.001011  0.000972

In [12]: np.linalg.norm(S - S_)                                                  
Out[12]: 0.002024216500682136

Some remarks:

Bottomline

I think both methods deserve attention, as method="diag" is conservative (it homogenously increases risk only) and interesting in this respect from an investment point of view. But method="spectral" is mathematically optimal (and the way to go, if you trust your data, in my opinion).

robertmartin8 commented 4 years ago

@schneiderfelipe

Thanks for the excellent work! My thoughts are as follows, let me know what you think!

Firstly, I think this is quite an advanced topic so I don't want the user to have to understand too much about what's going on. Thus in most cases, I think this fixing process should happen automatically on the backend, without the user having to explicitly choose a fixing method.

My plan is to define a new function fix_nonpositive_semidefinite(matrix, fix_method="spectral") in risk_models, which checks if the input matrix is positive semidefinite and fixes it if not. This method will be called at the end of all the other risk models (e.g return fix_nonpositive_semidefinite(cov)), though I will raise a warning if the original matrix isn't PSD just so the user is aware. Will probably have to provide a kwarg in all the functions so the choice of method can get passed down.

As for the default method, based on your analysis I think spectral makes more sense. In particular, I don't think it matters that assets are equally/unequally treated. e.g if I have two assets with variances 0.05 and 0.15, then subtract 0.04 from both ('equal'), the result is assets with 0.01 and 0.11. Which somehow feels very imbalanced anyway.

On another note, I've decided to adopt your suggestion from issue #3, i.e having a risk_matrix(method="sample_cov") master function. I don't plan on deprecating the other API just yet.

robertmartin8 commented 4 years ago

To get the diag method to work, I had to subtract a little bit more than the min eigenvalue (otherwise I think there are floating point issues around zero).

fixed_matrix = matrix - 1.1 * min_eig * np.eye(len(matrix))
schneiderfelipe commented 4 years ago

To get the diag method to work, I had to subtract a little bit more than the min eigenvalue (otherwise I think there are floating point issues around zero).

fixed_matrix = matrix - 1.1 * min_eig * np.eye(len(matrix))

Interesting, does not doing that brake cvxpy? I tested this, and indeed, the corrected matrix ends up having a small negative eigenvalue, but it is so small (~1e-18) that cvxpy didn't complain.

robertmartin8 commented 4 years ago

@schneiderfelipe

I didn't test it with cvxpy, but my _is_positive_semidefinite method (which attempts a Cholesky factorisation) returns False, so it messes up the error checking workflow. I figure multiplying by 1.1 is a relatively harmless way of amending this.

schneiderfelipe commented 4 years ago

@robertmartin8

I didn't test it with cvxpy, but my _is_positive_semidefinite method (which attempts a Cholesky factorisation) returns False, so it messes up the error checking workflow. I figure multiplying by 1.1 is a relatively harmless way of amending this.

Cholesky decomposition works for positive definite matrices only (all eigenvalues strictly greater than zero), so it fails for a matrix with a zero eigenvalue:

In [1]: import numpy as np                                                      
In [2]: np.linalg.cholesky(np.diag([0, 1, 2]))                                          
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-4-9739668864ab> in <module>
----> 1 np.linalg.cholesky(np.diag(w))

<__array_function__ internals> in cholesky(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/numpy/linalg/linalg.py in cholesky(a)
    753     t, result_t = _commonType(a)
    754     signature = 'D->D' if isComplexType(t) else 'd->d'
--> 755     r = gufunc(a, signature=signature, extobj=extobj)
    756     return wrap(r.astype(result_t, copy=False))
    757 

/usr/local/lib/python3.6/dist-packages/numpy/linalg/linalg.py in _raise_linalgerror_nonposdef(err, flag)
     98 
     99 def _raise_linalgerror_nonposdef(err, flag):
--> 100     raise LinAlgError("Matrix is not positive definite")
    101 
    102 def _raise_linalgerror_eigenvalues_nonconvergence(err, flag):

LinAlgError: Matrix is not positive definite

So, attempting Cholesky decomposition actually checks for strict positive definiteness. You might check for positive semidefiniteness by the following:

def _is_positive_semidefinite(a):
    w = np.linalg.eigvalsh(a)
    min_eig = np.min(w)
    return min_eig > 0 or np.isclose(min_eig, 0)

This ensures really small negative numbers are considered zero:

In [4]: _is_positive_semidefinite(np.diag([0, 1, 2]))
Out[4]: True
filipeteotonio commented 4 years ago

@robertmartin8 I've been inactive for the last week so that's why I didn't reply but it is so nice to see that the discussion went forward and the fix is already merged! Thank you guys.