skfolio / skfolio

Python library for portfolio optimization built on top of scikit-learn
https://skfolio.org
BSD 3-Clause "New" or "Revised" License
1.01k stars 76 forks source link

[BUG] IndexError when using HRP for a small set of indexes #58

Closed 8W9aG closed 1 month ago

8W9aG commented 1 month ago

Describe the bug

IndexError is raised when computing the maximum amount of clusters when the amount of columns in the returns dataframe is low.

│ /Users/xxx/dev/xxx/venv/lib/python3.11/site-packages/skfolio/optimization/cluster/hierar │
│ chical/_hrp.py:326 in fit                                                                        │
│                                                                                                  │
│   325 │   │                                                                                      │
│ ❱ 326 │   │   self.hierarchical_clustering_estimator_.fit(distance)                              │
│   327                                                                                            │
│                                                                                                  │
│ /Users/xxx/dev/xxx/venv/lib/python3.11/site-packages/skfolio/cluster/_hierarchical.py:19 │
│ 6 in fit                                                                                         │
│                                                                                                  │
│   195 │   │   if max_clusters is None:                                                           │
│ ❱ 196 │   │   │   max_clusters = compute_optimal_n_clusters(                                     │
│   197 │   │   │   │   distance=X,                                                                │
│                                                                                                  │
│ /Users/xxx/dev/xxx/venv/lib/python3.11/site-packages/skfolio/utils/stats.py:454 in       │
│ compute_optimal_n_clusters                                                                       │
│                                                                                                  │
│   453 │   for k in range(max_clusters):                                                          │
│ ❱ 454 │   │   level = cut_tree[:, n - k - 1]                                                     │
│   455 │   │   cluster_density = []                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
IndexError: index -4 is out of bounds for axis 1 with size 3

To Reproduce

X_before.parquet.zip

import pandas as pd
from skfolio.optimization import HierarchicalRiskParity
from skfolio.moments import DenoiseCovariance
from skfolio.distance import CovarianceDistance
from skfolio import RiskMeasure

df = pd.read_parquet("X_before.parquet")
model = HierarchicalRiskParity(risk_measure=RiskMeasure.MEAN_ABSOLUTE_DEVIATION,portfolio_params={"name": "HRP-MAD-Ward-DenoisedPearson"},distance_estimator=CovarianceDistance(covariance_estimator=DenoiseCovariance()))
model.fit(df)

Expected behavior

I believe this comes from the code here which defines max clusters at a minimum as 8, whereas the amount of columns could be less than 8.

Additional context

If HRP shouldn't be used for less than 8 return columns this could be exposed and allowed to be checked for by the consumer to avoid catching the IndexError. Otherwise I would suggest taking off the 8 max and changing it to the length of the columns array.

Versions 0.2.1

HugoDelatte commented 1 month ago

Good spot! It has been fixed in v0.2.2