Open anntzer opened 2 years ago
So... did you try higher value of support_fraction? It should have resolved the problem I suppose
Yes, but that's not really the point of the bug report: either this indicates an internal assertion error (which should not happen), or the message should be reworded to remove the "this should not happen", as explained in the original report.
After investigating, this is indeed a problem with the implementation, due to failure to detect rank-deficient matrices (exp(-46), the reported bad determinant value, is close to floating point slop). The original article (Rousseuw & van Driessen 1999) actually has a whole section devoted to this case (section 5 "Exact fit situations") and I think it would indeed make sense to report the singular covariance matrix in this case (all the inliers lie within a sub-plane, but it is still useful to report what that plane is).
From a quick check this can be verified by computing np.linalg.matrix_rank(covariance)
(which takes floating point slop into account) just before the warning in _c_step(). Also, as of HEAD, the above repro doesn't work anymore (due to changes in shuffling) but the following one still does:
np.random.seed(1)
# Highly degenerate data: many repeats of 6 different points in the plane.
xs = np.concatenate([[np.random.rand(2) + (0, i)] * c for i, c in enumerate([20, 20, 20, 4, 4, 3])])
np.random.shuffle(xs)
mcd = MinCovDet(random_state=1).fit(xs)
Describe the bug
Fitting the dataset given in the example below prints the following RuntimeWarning:
RuntimeWarning: Determinant has increased; this should not happen: log(det) > log(previous_det) (-6.080855240649472 > -46.216908559817156). You may want to try with a higher value of support_fraction (current value: 0.520).
It is not clear whether this is effectively an assertion error ("this should not happen"), i.e. a bug in the implementation; or a problem with the dataset (perhaps it's too degenerate for the fitter); that's OK too but the message's wording is a bit ambiguous.
Steps/Code to Reproduce
Expected Results
No warning (or a clearer one, e.g. "the dataset is degenerate and cannot be fitted with the current settings; please increase support_fraction").
Actual Results
RuntimeWarning: Determinant has increased; this should not happen: log(det) > log(previous_det) (-6.080855240649472 > -46.216908559817156). You may want to try with a higher value of support_fraction (current value: 0.520).
Versions