coef_ and intercept_ in MultinomialNB is pointless for binary classification

scikit-learn / scikit-learn

scikit-learn: machine learning in Python

https://scikit-learn.org

BSD 3-Clause "New" or "Revised" License

59.83k stars 25.35k forks source link

coef_ and intercept_ in MultinomialNB is pointless for binary classification #7233

Closed luoq closed 8 years ago

luoq commented 8 years ago

MultinomialNB inherits coef, intercept from BaseDiscreteNB defined by


    def _get_coef(self):
        return (self.feature_log_prob_[1:]
                if len(self.classes_) == 2 else self.feature_log_prob_)

    def _get_intercept(self):
        return (self.class_log_prior_[1:]
                if len(self.classes_) == 2 else self.class_log_prior_)

    coef_ = property(_get_coef)
    intercept_ = property(_get_intercept)

For binary classification, in order to be consistent with LinearClassifierMixin, the coef(intercept) should be the difference between second and first rows(elements)

amueller commented 8 years ago

The NBs are not consistent with LinearClassifierMixin IIRC. I don't think I understand your comment on what they should be instead.

luoq commented 8 years ago

The predict_proba method of MultinomialNB is the same as softmax(feature_logprob*X+class_logprior).

For binary classification, in LinearClassifierMixin, coef_ is of shape (1, nfeature) and intercept has length 1, we use p1=logistic(coef*X+intercept) as probability for class 1 (self.classes[1]) and 1-p1 for class 0(self.classes_[0]).

Since exp(s1)/(exp(s0)+exp(s1)) = 1/(1+exp(-(s1-s0))=logistic(s1-s0). We could use the difference between second and first rows(elements) in coef(intercept) to get probability, and the predict_proba method is the same as softmax logistic regression. We cannot write predict_prob with only the second row(element)

amueller commented 8 years ago

Sorry I still don't know what you mean by "row". Row in what?

And I'm not sure what you are saying. Are you saying the results of MultinomialNB.predict_proba are wrong? If so, can you give a self-contained example of that?

I'm not sure how any of this relates to LinearClassifierMixin. It is not used anywhere in MultinomialNB.

luoq commented 8 years ago

self.feature_logprob is a 2d array, row means row of self.feature_logprob.

predict_proba is correct; it use feature_logprob and class_logprior directly.

MultinomialNB is a linear classifier though it has no direct relation with LinearClassifierMixin in code. In principle we could reuse the predict methods of softmax logistic regression if set up coef and intercept as follow

    def _get_coef(self):
        return (self.feature_log_prob_[1:]-self.feature_log_prob_[:1]
                if len(self.classes_) == 2 else self.feature_log_prob_)

    def _get_intercept(self):
        return (self.class_log_prior_[1:]-self.class_log_prior_[:1]
                if len(self.classes_) == 2 else self.class_log_prior_)

    coef_ = property(_get_coef)
    intercept_ = property(_get_intercept)

Otherwise, the reason for setting up coef and intercept for MultinomialNB is strange. It seems they are not used anywhere.

luoq commented 8 years ago

Maybe we should remove these code in BaseDiscreteNB. They are quite pointless for BernoulliNB.

Though BernoulliNB has linear separation boundaries and can be adapted to be consistent with softmax logistic regression( reuse predict methods), the setting of coef, intercept is quite different.

luoq commented 8 years ago

This note in MultinomialNB may clarify my point regarding linear classifier

    Notes
    -----
    For the rationale behind the names `coef_` and `intercept_`, i.e.
    naive Bayes as a linear classifier, see J. Rennie et al. (2003),
    Tackling the poor assumptions of naive Bayes text classifiers, ICML.

luoq commented 8 years ago

self.coef_ is only used at line 507

        elif n_features != self.coef_.shape[1]:
            msg = "Number of features %d does not match previous data %d."
            raise ValueError(msg % (n_features, self.coef_.shape[-1]))

We can use feature_logprob instead

luoq commented 8 years ago

these changes are introduced in de16cb49. Note the comment

    # XXX The following is a stopgap measure; we need to set the dimensions
    # of class_log_prior_ and feature_log_prob_ correctly.

It seems they are left there since then

The reason to change dimension for binary classification is unclear. Maybe to simulate coef_ in LogisticRegression?

luoq commented 8 years ago

We can just remove coef, intercept for BaseDiscreteNB and forget about treating "naive Bayes as a linear classifier".

Or we can fix coef, intercept for MultinomialNB and write correct coef, intercept for BernoulliNB. This is helpful sometimes

amueller commented 8 years ago

Ok, so you're talking about this: #2237

amueller commented 8 years ago

the problem with this is that changing behavior like this might break people's code...

luoq commented 8 years ago

duplicated with #2237