Closed NMVRodrigues closed 10 months ago
Hi Nuno
Thank you very much for your message and I am sorry about this issue. Currently, the package aims to support multi-class classification with categorcal y-labels as well as numerical ones. During testing I tried it with various scikit learn classifiers and it seems to work fine. I attach a snipped of the example code below applied to a toy dataset using Catboost too, which seems to work for me:
` import numpy as np import pandas as pd from venn_abers import VennAbersCalibrator from catboost import CatBoostClassifier data = {'X1':[7, 6, 5, 2, 5, 7, 3, 7, 2, 1, 5], 'X2':[20, 21, 19, 18, 7, 12, 4, 12, 8, 3, 7], 'X3' : [6.1, 5.9, 6.0, 6.1, 5, 23, 5.5, 6.1, 4.5, 5.1, 5.5], 'y_label': ['M','N', 'F', 'F', 'N', 'F', 'F', 'M', 'N', 'N', 'F'] }
df = pd.DataFrame(data) X = df.iloc[:, :-1].values y = df.y_label.values clf = CatBoostClassifier(verbose=False) va = VennAbersCalibrator(estimator=clf, inductive=False, n_splits=2, random_state=101) clf.fit(X,y) va.fit(X,y) p_pred = clf.predict_proba(X) y_pred = clf.predict(X) p_prime = va.predict_proba(X) y_prime = va.predict(X, one_hot=False) `
Would it perhaps be possible to have a sample of the code you're using to see what the issue may be? Best regards, Ivan
Hi Ivan, Thank you for the swift reply!
I'm sorry for not being clear in the previous post, I meant the issue was with categorical features, not labels. Example:
import numpy as np
import pandas as pd
from venn_abers import VennAbersCalibrator
from catboost import CatBoostClassifier
data = {'X1':[7, 6, 5, 2, 5, 7, 3, 7, 2, 1, 5],
'X2':[20, 21, 19, 18, 7, 12, 4, 12, 8, 3, 7],
'X3' : [6.1, 5.9, 6.0, 6.1, 5, 23, 5.5, 6.1, 4.5, 5.1, 5.5],
'X4' : ['a', 'b', 'a', 'c', 'a', 'b', 'b', 'a', 'c', 'b', 'a'],
'y_label': ['M','N', 'F', 'F', 'N', 'F', 'F', 'M', 'N', 'N', 'F']
}
df = pd.DataFrame(data)
X = df.iloc[:, :-1].values
y = df.y_label.values
clf = CatBoostClassifier(verbose=False, cat_features=[3])
va = VennAbersCalibrator(estimator=clf, inductive=False, n_splits=2, random_state=101)
clf.fit(X,y)
va.fit(X,y)
p_pred = clf.predict_proba(X)
y_pred = clf.predict(X)
p_prime = va.predict_proba(X)
y_prime = va.predict(X, one_hot=False)
Hi Nuno, thank you again for your message. I have published a branch https://github.com/ip200/venn-abers/tree/categorical_support which I hope will solve your issue. Would you mind trying it when you have a moment? In this instance you just need to pass the clf = CatBoostClassifier(verbose=False) to the VennAbersCalibrator, i.e. without the cat_features option. Thanks, Ivan
Hi Ivan, sorry for the delay.
I'm not sure if I fully understand what you mean, since it is not possible to fit the CatBoostClassifier
without providing the cat_features
when the dataset includes categorical features.
I also did try to simply pass the clf
, without fitting it, to the va
, which produces the same error as before, of not being able to convert a string to a float.
Tried both approaches with the new branch you mentioned.
Did you get working results?
Best, Nuno
Hi Nuno
thank you very much for trying. This is the snippet of the code I tried on this new branch:
import pandas as pd
from src.venn_abers import VennAbersCalibrator
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
data = {'X1':[7, 6, 5, 2, 5, 7, 3, 7, 2, 1, 5],
'X2':[20, 21, 19, 18, 7, 12, 4, 12, 8, 3, 7],
'X3' : [6.1, 5.9, 6.0, 6.1, 5, 23, 5.5, 6.1, 4.5, 5.1, 5.5],
'X4' : ['a', 'b', 'a', 'c', 'a', 'b', 'b', 'a', 'c', 'b', 'a'],
'y_label': ['M','N', 'F', 'F', 'N', 'F', 'F', 'M', 'N', 'N', 'F']
}
df = pd.DataFrame(data)
X = df.iloc[:, :-1]
y = df.y_label.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
clf = CatBoostClassifier(verbose=False)
va = VennAbersCalibrator(estimator=clf, inductive=True, cal_size=0.5, random_state=101)
va.fit(X_train, y_train)
p_prime = va.predict_proba(X_test)
y_prime = va.predict(X_test, one_hot=False)
which seems to work fine. Would it be possible to check if it runs for you?
Best,
Ivan
Hi Ivan,
That works perfectly yes, I had made a typo when doing git checkout which is why it wasn't working. Thank you so much for addressing this! Will this change be added to the pip package soon?
Best, Nuno
Same here, would love to have this added to the pip package soon, I'm using LightGBM as reference
Hi, the pip package now contains this latest update.
Please run
pip install venn-abers==1.4.1
Please note that the categorical features are handled using pandas.get_dummies() within VennAbersCalibrator .I hope this is OK, if you encounter any issues, please let me know. Thanks, Ivan
Hi, Currently, it's not possible to apply the multi-class VA to estimators/data that contain categorical variables, due to the restrictions of the OneVsOneClassifier (unless I'm mistaken, although whenever I pass a Catboost trained on categorical data it throws an error in the evaluate function of OneVsOneClassifier, while a fully numerical Catboost works fine, trace in the end). Are there plans to extend VA to include categorical models?
Best regards, Nuno