nyanp / optiver-realized-volatility-prediction

23 stars 6 forks source link

bug(KNN FIT): Neighbors metric_params expecting 'VI', not 'V' #1

Open ryanrussell opened 2 years ago

ryanrussell commented 2 years ago

Hi,

I'm getting this from lightgbm-clustering-model.ipynb

Traceback (most recent call last):
  File "/examples/nyan-optiver/lightgbm-clustering-model.py", line 340, in <module>
    k_neighbors_p1 = Neighbors(pivot, 2, metric='mahalanobis', metric_params={'V':np.cov(pivot.values.T)})
  File "/examples/nyan-optiver/lightgbm-clustering-model.py", line 325, in __init__
    self.distances, self.neighbors = nn.kneighbors(pivot, return_distance=True)
  File "/opt/bitnami/python/lib/python3.10/site-packages/sklearn/neighbors/_base.py", line 752, in kneighbors
    chunked_results = list(
  File "/opt/bitnami/python/lib/python3.10/site-packages/sklearn/metrics/pairwise.py", line 1709, in pairwise_distances_chunked
    params = _precompute_metric_params(X, Y, metric=metric, **kwds)
  File "/opt/bitnami/python/lib/python3.10/site-packages/sklearn/metrics/pairwise.py", line 1551, in _precompute_metric_params
    raise ValueError(
ValueError: The 'VI' parameter is required for the mahalanobis metric when Y is passed.

Comes from your KNN Fit notebook:

k_neighbors_p1 = Neighbors(pivot, 2, metric='mahalanobis', metric_params={'V':np.cov(pivot.values.T)})
...
k_neighbors_size = Neighbors(pivot, 2, metric='mahalanobis', metric_params={'V':np.cov(pivot.values.T)})

Changing these from V -> VI per the error message resolves and runs. Is this supposed to be VI?

python --version
Python 3.10.4
...

pip list | grep 'sci\|sklearn'
scikit-learn             1.0.2
scipy                    1.8.0
sklearn                  0.0

Happy to submit a PR with the change, but wanted to ask first if this is correct.

Also, I've done some refactoring and Dockerizing for our purposes. If you like, I can submit a PR back with a Dockerfile and a docker compose yaml file to make this easier for people to bring up on their own machines.

nyanp commented 2 years ago

@ryanrussell Thanks for the report :) This code originally worked with scikit-learn 0.23.2 on Kaggle notebook environment (here), but due to an API change on scikit-learn, it no longer works with the newer version of scikit-learn.

https://github.com/scikit-learn/scikit-learn/pull/16993

Please note that VI means the inverse matrix of V, so simply renaming the keywords will not work as intended. Therefore, explicitly calculating the inverse matrix with reference to the PR above will solve this problem.

Of course, pull requests to fix it and add a Dockerfile are welcome!