smartcorelib / smartcore

A comprehensive library for machine learning and numerical computing. The library provides a set of tools for linear algebra, numerical computing, optimization, and enables a generic, powerful yet still efficient approach to machine learning.
https://smartcorelib.org/
Apache License 2.0
671 stars 76 forks source link

Prediction of GaussianNB panics in some conditions #245

Open chriamue opened 1 year ago

chriamue commented 1 year ago

I'm submitting a

Current Behaviour:

Prediction of GaussianNB fails when trained on very small data.

Expected Behaviour:

Assertion of Y may fail but should not panic.

Steps to reproduce:

    #[test]
    fn run_gaussian_naive_bayes_with_few_samples() {
        let x = DenseMatrix::<f64>::from_2d_array(&[
            &[-1., -1.],
            &[-2., -1.],
            &[-3., -2.],
            &[1., 1.],
        ]);
        let y: Vec<u32> = vec![1, 1, 1, 2];

        let gnb = GaussianNB::fit(&x, &y, Default::default()).unwrap();

        assert_eq!(gnb.classes(), &[1, 2]);
        let y_hat = gnb.predict(&x).unwrap();
        assert_eq!(y_hat, y);
    }

Snapshot:

thread 'classifier::bayes::tests::test_detector' panicked at 'called `Option::unwrap()` on a `None` value', /Users/chriamue/.cargo/registry/src/github.com-1ecc6299db9ec823/smartcore-0.3.0/src/naive_bayes/mod.rs:109:67

https://github.com/smartcorelib/smartcore/blob/development/src/naive_bayes/mod.rs#L109

.map(|(class_index, class)| {
    (
        class,
        self.distribution.log_likelihood(class_index, &row)
            + self.distribution.prior(class_index).ln(),
    )
})
.max_by(|(_, p1), (_, p2)| p1.partial_cmp(p2).unwrap())

p1 or p2 is NaN so partial_cmp is None which panics on unwrap().

Do you want to work on this issue?

A solution would be to check the log_likelihood for NaN and return 0, but I am not sure if this is mathematically right. https://github.com/smartcorelib/smartcore/blob/development/src/naive_bayes/gaussian.rs#L77

    fn log_likelihood<'a>(&self, class_index: usize, j: &'a Box<dyn ArrayView1<X> + 'a>) -> f64 {
        let mut likelihood = 0f64;
        for feature in 0..j.shape() {
            let value = X::to_f64(j.get(feature)).unwrap();
            let mean = self.theta[class_index][feature];
            let variance = self.var[class_index][feature];
            likelihood += self.calculate_log_probability(value, mean, variance);
        }
        if likelihood.is_nan() {
            0f64
        } else {
            likelihood
        }
    }

Another solution is to check for NaN in predict function: https://github.com/smartcorelib/smartcore/blob/development/src/naive_bayes/mod.rs#L102

.map(|(class_index, class)| {self.distribution.prior(class_index).ln());
  let mut log_likelihood = self.distribution.log_likelihood(class_index, &row);
  if log_likelihood.is_nan() {
      log_likelihood = 0.0
  }
  (
      class,
      log_likelihood
          + self.distribution.prior(class_index).ln(),
  )
})
Mec-iS commented 1 year ago

thanks for finding this 👍

Mec-iS commented 1 year ago

I think the right approach would be to return an error if log_likehood is NaN. Let me check.