Problems in sklearn.decomposition.PCA with "n_components='mle' option"

scikit-learn / scikit-learn

scikit-learn: machine learning in Python

https://scikit-learn.org

BSD 3-Clause "New" or "Revised" License

59.93k stars 25.37k forks source link

Problems in sklearn.decomposition.PCA with "n_components='mle' option" #4441

Closed alexis-mignon closed 4 years ago

alexis-mignon commented 9 years ago

We have found several problems in the implementation of the method to automatically tune the number of components of the PCA algorithms:

The algorithm never tests full rank: this is most probably due to the fact that loops using the rank end always at rank-1 (for i in range(rank)).
If two eigen values are equals there is a log(0) issue.
Zeros eigen values are not treated explicitly

Possible solutions:

For (1): Checking the loops ranges
For (3): Predetecting small eigen values lower than the numerical noise excluding them from rank scan

I have no idea for 2. We had the problem here with very small eigen values (in numerical noise) which were totally identical. I never managed to create a syntetic dataset which reproduce the problem since the even with symetric datasets, there is always a small difference (in the order of numerical precision) between theoretically identical eigen values.

lbillingham commented 9 years ago

I've also come across the log(0) error sklearn/decomposition/pca_assess_dimension_ in the wild. I've n_samples = 1017, n_features = 119 and the explained_variance drops off a cliff:

pca_mle_log0

My hunch is that we could test in _assess_dimension_ for

    spectrum[i] - spectrum[j] < sys.float_info.epsilon

and break out of the pa gathering loops. I'd be happy to make a pull request along those lines but. I'm sure that is the right thing to do.

amueller commented 9 years ago

why would you test for this difference? you could have (theoretically) two large but identical eigenvalues, right? Why not test if spectrum[i] < eps and then break?

amueller commented 9 years ago

also if v is 0 don't we just want to return -np.inf?

lbillingham commented 9 years ago

What is the sensible tolerance for v~=0.0? The pure equality v==0.0 test will almost never be True Using:

if sys.float_info.epsilon > abs(v):
    return -np.inf

gives this

pca__infer_dimension__test_v_loop_break

which I think looks OK.

If we do not test for v but only brerak if spectrum[i] < eps we end up with huge pa and hence ll:

pca__infer_dimension__loop_break

I forked, branched and updated includign a couple of tests.

amueller commented 9 years ago

please open a PR for people to review your code.

amueller commented 9 years ago

what data did you run your tests on?

lbillingham commented 9 years ago

I haven't been able to construct a synthetic pathological X and I am unable to provide the data from the wild.

Since _assess_dimension_ and _infer_dimension_ are imported in test_pca I thought I could just provide a small pathological spectrum directly in the test function.

brauliobarahona commented 7 years ago

Hi all, I get this error in _assess_dimension_(spectrum, rank, n_samples, n_features)

     85         for j in range(i + 1, len(spectrum)):
     86             pa += log((spectrum[i] - spectrum[j]) *
---> 87                       (1. / spectrum_[j] - 1. / spectrum_[i])) + log(n_samples)
     88
     89     ll = pu + pl + pv + pp - pa / 2. - rank * log(n_samples) / 2.
ValueError: math domain error

The settings of PCA are n_components='mle' and svd_solver='full', is this the same error as reported here?

H4dr1en commented 5 years ago

Is the bug now fixed?

I'm having the same kind of "L" shape, making me think the PCA did not went well (without error though), without specifying any n_component. (sklearn==0.21.1)

jnothman commented 5 years ago

There is a PR to fix it at #10359, which appears stalled. Help is welcome to finish it up