Open amueller opened 5 years ago
Hey Im pretty new, can I take a look at this and try to fix it?
This should reproduce the issue:
from sklearn.multioutput import ClassifierChain
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
Xs, ys = zip(*(make_classification(random_state=i, n_classes=3, n_informative=3) for i in range(3)))
X = np.hstack(Xs)
Y = np.transpose(ys)
ClassifierChain(LogisticRegression()).fit(X, Y).predict(X)
but actually this is running without complaint for me!!
@amueller, please provide a reproducible snippet??
ah, it's only decision_function
that's broken.
If predict
works, maybe fixing decision_function
is reasonable?
I didn't actually read #9245 in detail: it looks like the predict
part is already done in master?
from sklearn.multioutput import ClassifierChain
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import numpy as np
Xs, ys = zip(*(make_classification(random_state=i, n_classes=3, n_informative=3) for i in range(3)))
X = np.hstack(Xs)
Y = np.transpose(ys)
ClassifierChain(LogisticRegression()).fit(X, Y).decision_function(X)
Maybe we should try fixing decision_function and predict_proba and if that's too hard we should raise attribute errors in the multioutput multiclass case?
In case of raising an attribute error, is it enough to add an if statement, that calls the type_of_target function of Y? Something along the line of:
if type_of_target(Y_decision_chain) == 'multiclass-multioutput': raise AttributeError("decision_function currently does not support multiclass-multioutput as target type")
It shouldn't be hard to impenetrable support I think. type_of_target will be easiest to apply at fit time.
So what do we need to add for this support? As I understood decision_function expects a 2d-Array when calling estimator.decision_function(X_aug) but since in our case its a multiclass-multioutput target type it gets a list of 2d-arrays instead. Is it suitable to just add more collumns to the variable Y_decision_chain and fill it with the collumns from each 2d-array from the list of 2d-arrays we get?
Something like that. You might need to use the classes_ attribute of the underlying estimators to get the columns lined up when cv might result in a class being missing from one training set or another.
So I looked at it again and made some modifications but Im really not sure if I understood it correctly. This is the code that I changed in the decision_function.
decision_function_result = estimator.decision_function(X_aug)
if decision_function_result.shape[1] == 1:
Y_decision_chain[:, chain_idx] = decision_function_result
else:
Y_decision_chain[:, chain_idx] = decision_function_result[:,chain_idx]
Basicly each multi-output classifier in the Classifierchain is responsible for one row in the Y_decision_chain matrix. I threw away my previous idea of adding rows to Y_decision_chain , since it still need to conform the output rule of the decision_function. Is this solving the problem at hand? I'm sorry if it might be wrong or if I didn't understand you correctly. I dont know how else to use the decision_function of each estimator to modify the Y_decision_chain, since each class is being predicted multible times.
Hi, I'm new and looking for an issue to work on. Is this issue still open?
@michellemroy there's currently some work being done in #14654
I am new to the repo , is this issue still open?
Hi, can I work on this? I'm new so can someone guide me?
Hello, Can I use this issue as my first contribution? I'm new so please can someone guide me?
I would like to contribute but we should assign the tickets that are open but appear to be assigned.
If this issue hasn't been resolved already, iId like to contribute! It would be my first open-source contribution
Hi everyone, I'm new to the open source world and I want to contribute, please let me know a easy good first issue to start with. It will be really helpful if someone can guide me.
I am working on this!
Can anyone confirm is this issue still open or not??
As far as I can see, this is now closed - the code aboves still returns an error, but the error looks reasonably intelligible. I would close this issue unless others feel differently.
trying to run classifier chain on a multiclass problem with a reshaped y (which is interpreted as multioutput multiclass) doesn't work (see #9245), which is fine. But the error messages are really bad making it hard to understand what's going on.
We should detect that we're trying to do multioutput multiclass instead of multilabel and give an informative error.