rust-ml / linfa

A Rust machine learning framework.
Apache License 2.0
3.78k stars 253 forks source link

Discrepancy in GMM covariance between sklearn and linfa #321

Closed jamiesandenr closed 1 year ago

jamiesandenr commented 1 year ago

There is a significant discrepancy between the covariance resulting from Gassian mixture calculations in linfa-clustering and scikit-learn.

To reproduce

Windows 10 Business, version 21H2, build 19044.3086

Python

Environment:

# Python 3.10.10
numpy==1.24.2
scikit-learn==1.2.2

Code:

import numpy as np
from sklearn.mixture import GaussianMixture

SEED = 123
TOL = 1e-4
MAX_ITER = 500
COMPONENT_COUNT = 3
N_RUNS = 1

data_0 = np.random.normal(loc=0, scale=0.5, size=500)
data_1 = np.random.normal(loc=1, scale=0.5, size=500)
data_2 = np.random.normal(loc=2, scale=0.5, size=500)
data = np.concatenate([data_0, data_1, data_2], axis=0)

res = GaussianMixture(
    n_components=COMPONENT_COUNT,
    covariance_type="full",
    n_init=N_RUNS,
    tol=TOL,
    random_state=SEED,
    max_iter=MAX_ITER,
).fit(data.reshape(-1, 1))

print("Means")
print(res.means_)
print("Weights")
print(res.weights_)
print("Covariances")
print(res.covariances_)

Output:

Means
[[2.03128121]
 [0.0749942 ]
 [1.1236147 ]]
Weights
[0.29105623 0.38220574 0.32673803]
Covariances
[[[0.24430291]]

 [[0.2784773 ]]

 [[0.25548545]]]

Rust

Environment: Cargo.lock.txt Cargo.toml.txt

main.rs

use linfa::prelude::Fit;
use linfa_clustering;
use ndarray;
use ndarray_rand::rand::SeedableRng;
use ndarray_rand::rand_distr::Normal;
use ndarray_rand::RandomExt;

const SEED: u64 = 123;
const TOL: f64 = 1e-4;
const MAX_ITER: u64 = 500;
const COMPONENT_COUNT: usize = 3;
const N_RUNS: u64 = 1;

fn main() {
    let rng = rand_xoshiro::Xoshiro256Plus::seed_from_u64(SEED);

    let data_0 = ndarray::Array::random((500,), Normal::new(0., 0.5).unwrap());
    let data_1 = ndarray::Array::random((500,), Normal::new(1., 0.5).unwrap());
    let data_2 = ndarray::Array::random((500,), Normal::new(2., 0.5).unwrap());
    let data = ndarray::concatenate![ndarray::Axis(0), data_0, data_1, data_2];

    let data_2d = data.insert_axis(ndarray::Axis(1)).to_owned();
    let dataset = linfa::DatasetBase::from(data_2d);

    let res = linfa_clustering::GaussianMixtureModel::params(COMPONENT_COUNT)
        .n_runs(N_RUNS)
        .tolerance(TOL)
        .with_rng(rng)
        .max_n_iterations(MAX_ITER)
        .fit(&dataset)
        .expect("");

    println!("Means");
    println!("{}", res.means());
    println!("Weights");
    println!("{}", res.weights());
    println!("Covariances");
    println!("{}", res.covariances());
}

Output:

Means
[[1.0975859660989096],
 [2.0599598699542603],
 [-0.008753318474694901]]
Weights
[0.34407126094761353, 0.3052086360074772, 0.3507201030449094]
Covariances
[[[0.10158568331328428]],

 [[0.15278259627990826]],

 [[0.15664447153777117]]]

Expected behaviour

Means, weights and covariances are similar in both implementations

Actual behaviour

Means and weights are similar in both implementations, but Rust's covariances are around 2/3 the size of Python's. I have observed this quite consistently.

relf commented 1 year ago

Hi, thanks for the detailed report and good catch! The covariances were not updated during the iterative process. I've just open a PR to fix this issue.