ip200 / venn-abers

Python implementation of binary and multi-class Venn-ABERS calibration
MIT License
114 stars 11 forks source link

Categorical Support #10

Closed NMVRodrigues closed 10 months ago

NMVRodrigues commented 11 months ago

Hi, Currently, it's not possible to apply the multi-class VA to estimators/data that contain categorical variables, due to the restrictions of the OneVsOneClassifier (unless I'm mistaken, although whenever I pass a Catboost trained on categorical data it throws an error in the evaluate function of OneVsOneClassifier, while a fully numerical Catboost works fine, trace in the end). Are there plans to extend VA to include categorical models?

Traceback (most recent call last):
  File "/home/aime/Desktop/Nuno/UC/uc_conformal_calibration.py", line 154, in <module>
    va_cv.fit(np.asarray(X_train), np.asarray(y_train))
  File "/home/aime/miniconda3/envs/uc/lib/python3.9/site-packages/venn_abers/venn_abers.py", line 771, in fit
    self.va_calibrator.fit(_x_train, _y_train)
  File "/home/aime/miniconda3/envs/uc/lib/python3.9/site-packages/venn_abers/venn_abers.py", line 546, in fit
    self.clf_ovo = OneVsOneClassifier(self.estimator).fit(_x_train, _y_train)
  File "/home/aime/miniconda3/envs/uc/lib/python3.9/site-packages/sklearn/multiclass.py", line 676, in fit
    X, y = self._validate_data(
  File "/home/aime/miniconda3/envs/uc/lib/python3.9/site-packages/sklearn/base.py", line 584, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/home/aime/miniconda3/envs/uc/lib/python3.9/site-packages/sklearn/utils/validation.py", line 1106, in check_X_y
    X = check_array(
  File "/home/aime/miniconda3/envs/uc/lib/python3.9/site-packages/sklearn/utils/validation.py", line 879, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "/home/aime/miniconda3/envs/uc/lib/python3.9/site-packages/sklearn/utils/_array_api.py", line 185, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
ValueError: could not convert string to float: 'Present'

Best regards, Nuno

ip200 commented 11 months ago

Hi Nuno

Thank you very much for your message and I am sorry about this issue. Currently, the package aims to support multi-class classification with categorcal y-labels as well as numerical ones. During testing I tried it with various scikit learn classifiers and it seems to work fine. I attach a snipped of the example code below applied to a toy dataset using Catboost too, which seems to work for me:

` import numpy as np import pandas as pd from venn_abers import VennAbersCalibrator from catboost import CatBoostClassifier data = {'X1':[7, 6, 5, 2, 5, 7, 3, 7, 2, 1, 5], 'X2':[20, 21, 19, 18, 7, 12, 4, 12, 8, 3, 7], 'X3' : [6.1, 5.9, 6.0, 6.1, 5, 23, 5.5, 6.1, 4.5, 5.1, 5.5], 'y_label': ['M','N', 'F', 'F', 'N', 'F', 'F', 'M', 'N', 'N', 'F'] }

df = pd.DataFrame(data) X = df.iloc[:, :-1].values y = df.y_label.values clf = CatBoostClassifier(verbose=False) va = VennAbersCalibrator(estimator=clf, inductive=False, n_splits=2, random_state=101) clf.fit(X,y) va.fit(X,y) p_pred = clf.predict_proba(X) y_pred = clf.predict(X) p_prime = va.predict_proba(X) y_prime = va.predict(X, one_hot=False) `

Would it perhaps be possible to have a sample of the code you're using to see what the issue may be? Best regards, Ivan

NMVRodrigues commented 11 months ago

Hi Ivan, Thank you for the swift reply!

I'm sorry for not being clear in the previous post, I meant the issue was with categorical features, not labels. Example:

import numpy as np
import pandas as pd
from venn_abers import VennAbersCalibrator
from catboost import CatBoostClassifier
data = {'X1':[7, 6, 5, 2, 5, 7, 3, 7, 2, 1, 5],
'X2':[20, 21, 19, 18, 7, 12, 4, 12, 8, 3, 7],
'X3' : [6.1, 5.9, 6.0, 6.1, 5, 23, 5.5, 6.1, 4.5, 5.1, 5.5],
'X4' : ['a', 'b', 'a', 'c', 'a', 'b', 'b', 'a', 'c', 'b', 'a'],
'y_label': ['M','N', 'F', 'F', 'N', 'F', 'F', 'M', 'N', 'N', 'F']
}

df = pd.DataFrame(data)
X = df.iloc[:, :-1].values
y = df.y_label.values
clf = CatBoostClassifier(verbose=False, cat_features=[3])
va = VennAbersCalibrator(estimator=clf, inductive=False, n_splits=2, random_state=101)
clf.fit(X,y)
va.fit(X,y)
p_pred = clf.predict_proba(X)
y_pred = clf.predict(X)
p_prime = va.predict_proba(X)
y_prime = va.predict(X, one_hot=False)
ip200 commented 11 months ago

Hi Nuno, thank you again for your message. I have published a branch https://github.com/ip200/venn-abers/tree/categorical_support which I hope will solve your issue. Would you mind trying it when you have a moment? In this instance you just need to pass the clf = CatBoostClassifier(verbose=False) to the VennAbersCalibrator, i.e. without the cat_features option. Thanks, Ivan

NMVRodrigues commented 10 months ago

Hi Ivan, sorry for the delay. I'm not sure if I fully understand what you mean, since it is not possible to fit the CatBoostClassifier without providing the cat_features when the dataset includes categorical features. I also did try to simply pass the clf, without fitting it, to the va, which produces the same error as before, of not being able to convert a string to a float. Tried both approaches with the new branch you mentioned. Did you get working results?

Best, Nuno

ip200 commented 10 months ago

Hi Nuno

thank you very much for trying. This is the snippet of the code I tried on this new branch:

import pandas as pd
from src.venn_abers import VennAbersCalibrator
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier

data = {'X1':[7, 6, 5, 2, 5, 7, 3, 7, 2, 1, 5],
        'X2':[20, 21, 19, 18, 7, 12, 4, 12, 8, 3, 7],
        'X3' : [6.1, 5.9, 6.0, 6.1, 5, 23, 5.5, 6.1, 4.5, 5.1, 5.5],
        'X4' : ['a', 'b', 'a', 'c', 'a', 'b', 'b', 'a', 'c', 'b', 'a'],
        'y_label': ['M','N', 'F', 'F', 'N', 'F', 'F', 'M', 'N', 'N', 'F']
        }

df = pd.DataFrame(data)
X = df.iloc[:, :-1]
y = df.y_label.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
clf = CatBoostClassifier(verbose=False)
va = VennAbersCalibrator(estimator=clf, inductive=True, cal_size=0.5, random_state=101)
va.fit(X_train, y_train)
p_prime = va.predict_proba(X_test)
y_prime = va.predict(X_test, one_hot=False)

which seems to work fine. Would it be possible to check if it runs for you?

Best,

Ivan

NMVRodrigues commented 10 months ago

Hi Ivan,

That works perfectly yes, I had made a typo when doing git checkout which is why it wasn't working. Thank you so much for addressing this! Will this change be added to the pip package soon?

Best, Nuno

pauzzz commented 10 months ago

Same here, would love to have this added to the pip package soon, I'm using LightGBM as reference

ip200 commented 10 months ago

Hi, the pip package now contains this latest update.

Please run

pip install venn-abers==1.4.1

Please note that the categorical features are handled using pandas.get_dummies() within VennAbersCalibrator .I hope this is OK, if you encounter any issues, please let me know. Thanks, Ivan