Closed alexis-mignon closed 4 years ago
I've also come across the log(0)
error sklearn/decomposition/pca_assess_dimension_
in the wild. I've n_samples = 1017
, n_features = 119
and the explained_variance
drops off a cliff:
My hunch is that we could test in _assess_dimension_
for
spectrum[i] - spectrum[j] < sys.float_info.epsilon
and break out of the pa
gathering loops. I'd be happy to make a pull request along those lines but. I'm sure that is the right thing to do.
why would you test for this difference? you could have (theoretically) two large but identical eigenvalues, right?
Why not test if spectrum[i] < eps
and then break?
also if v is 0 don't we just want to return -np.inf?
What is the sensible tolerance for v~=0.0
? The pure equality v==0.0
test will almost never be True
Using:
if sys.float_info.epsilon > abs(v):
return -np.inf
gives this
which I think looks OK.
If we do not test for v
but only brerak if spectrum[i] < eps
we end up with huge pa
and hence ll
:
I forked, branched and updated includign a couple of tests.
please open a PR for people to review your code.
what data did you run your tests on?
I haven't been able to construct a synthetic pathological X
and I am unable to provide the data from the wild.
Since _assess_dimension_
and _infer_dimension_
are imported in test_pca
I thought I could just provide a small pathological spectrum
directly in the test function.
Hi all, I get this error in _assess_dimension_(spectrum, rank, n_samples, n_features)
85 for j in range(i + 1, len(spectrum)):
86 pa += log((spectrum[i] - spectrum[j]) *
---> 87 (1. / spectrum_[j] - 1. / spectrum_[i])) + log(n_samples)
88
89 ll = pu + pl + pv + pp - pa / 2. - rank * log(n_samples) / 2.
ValueError: math domain error
The settings of PCA are n_components='mle'
and svd_solver='full'
, is this the same error as reported here?
Is the bug now fixed?
I'm having the same kind of "L" shape, making me think the PCA did not went well (without error though), without specifying any n_component. (sklearn==0.21.1)
There is a PR to fix it at #10359, which appears stalled. Help is welcome to finish it up
We have found several problems in the implementation of the method to automatically tune the number of components of the PCA algorithms:
for i in range(rank)
).Possible solutions:
I have no idea for 2. We had the problem here with very small eigen values (in numerical noise) which were totally identical. I never managed to create a syntetic dataset which reproduce the problem since the even with symetric datasets, there is always a small difference (in the order of numerical precision) between theoretically identical eigen values.