iit-cs579 / main

CS579: Online Social Network Analysis at the Illinois Institute of Technology
147 stars 204 forks source link

Dose anybody know what is a negative cross validation accuracy mean in linear regression model? #30

Closed jchen111 closed 9 years ago

jchen111 commented 9 years ago

Dose anybody know what is a negative cross validation accuracy mean in linear regression model? We are fitting our data to sklearn linear regression model and get a negative accuracy which really make me confused.

mramire8 commented 9 years ago

Accuracy is a percentage and having negative value does not make sense in the context of the classifier performance. The range should be [0,1.] How are you computing the accuracy?

Are you using the sklearn.metrics package to compute the performance? (http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

Maria E. Ramirez-Loaiza Ph.D. Candidate CS Department - Machine Learning Laboratory Illinois Institute of Technology On 11/27/2014 2:40:09 PM, JIAQI CHEN notifications@github.com wrote: Dose anybody know what is a negative cross validation accuracy mean in linear regression model? We are fitting our data to sklearn linear regression model and get a negative accuracy which really make me confused. — Reply to this email directly or view it on GitHub [https://github.com/iit-cs579/main/issues/30].

jchen111 commented 9 years ago

I'm using http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html am I on the right track? just like this cross_val_score(LinearRegression(), X, y, cv=cv)

mramire8 commented 9 years ago

I am guessing your cv is a cross validation object of any of the classes available such KFold. I would try it like this:

classifier = LinearRegression()
scores= cross_val_score(classifier, X,y, cv=cv, scoring='accuracy')

scores should be an array with the values per every fold of the cv. If the default scorer of your estimator is not accuracy then the results you are getting are not that measure.

Try setting the scoring measure explicitly as 'accuracy' and see if that gives you values in the expected range.

jchen111 commented 9 years ago

We took your instruction and the code is like this:

def do_cv_linear(X, y, nfolds=10):
    cv = KFold(len(y), nfolds)
    return np.mean(cross_val_score(LinearRegression(), X, y, cv=cv,scoring = 'accuracy'))

and we get an error ValueError: Can't handle mix of multiclass and continuous

mramire8 commented 9 years ago

Yes, I see the problem, I misread Linear for Logistic. If you use LinearRegression you need to use a scoring according to regression tasks. For example, mean squared error. Accuracy is for classification tasks. I am guessing your y vector is a scalar vector.

If you are doing regression then use a regression measure, such as "mean_absolute_error" or "mean_squared_error"(y is a scalar vector). If you are doing classification the use "accuracy" or "f1" (y is a label vector) according to what you want to measure.