scikit-learn-contrib / scikit-matter

A collection of scikit-learn compatible utilities that implement methods born out of the materials science and chemistry communities
https://scikit-matter.readthedocs.io/en/v0.2.0/
BSD 3-Clause "New" or "Revised" License
70 stars 18 forks source link

Fix/check estimator #196

Closed rosecers closed 1 year ago

rosecers commented 1 year ago

scikit-learn-contrib (our PR is https://github.com/scikit-learn-contrib/scikit-learn-contrib/issues/62) requires that all estimators pass a check_estimators test ala https://scikit-learn.org/stable/modules/generated/sklearn.utils.estimator_checks.parametrize_with_checks.html#sklearn.utils.estimator_checks.parametrize_with_checks, which ours were not currently doing. I've been going through everything that inherits from the BaseEstimator class (or should) and making the required changes. @agoscinski would appreciate your input on the linear_models section, as I'm not sure that OrthogonalRegression can pass the estimators, as to my knowledge the predicted values may have a different shape that the fitted y, no? Please correct me if I'm wrong.


:books: Documentation preview :books:: https://scikit-matter--196.org.readthedocs.build/en/196/

rosecers commented 1 year ago

@agoscinski I am honestly perplexed by some of the testing errors coming up right now (matmul failing for correctly-shaped matrices). Would appreciate thoughts -- @ceriottm @PicoCentauri @Luthaf and others also welcome to weigh in

Running tests/test_metrics.py, I get things like:

ERROR: test_local_reconstruction_error_train_idx (__main__.ReconstructionMeasuresTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rca/source_installs/scikit-matter/tests/test_metrics.py", line 141, in test_local_reconstruction_error_train_idx
    lfre_val = pointwise_local_reconstruction_error(
  File "/Users/rca/miniconda3/lib/python3.10/site-packages/skmatter/metrics/_reconstruction_measures.py", line 456, in pointwise_local_reconstruction_error
    - 2 * X_test @ X_train.T
  File "/Users/rca/miniconda3/lib/python3.10/site-packages/numpy/ma/core.py", line 3077, in __array_wrap__
    m = reduce(mask_or, [getmaskarray(arg) for arg in input_args])
  File "/Users/rca/miniconda3/lib/python3.10/site-packages/numpy/ma/core.py", line 1757, in mask_or
    return make_mask(umath.logical_or(m1, m2), copy=copy, shrink=shrink)
ValueError: operands could not be broadcast together with shapes (15,4) (4,5) 

I've checked, it's within the @ operator, not the *. Also checked that both matrices are the same type.

For the record, this only happens for tests/test_metrics.py and tests/test_sparse_kernel_centerer.py on my end.

agoscinski commented 1 year ago

In the StandartFlexibleScaler np.ma.* is used which transforms the mean to a <class 'numpy.ma.core.MaskedArray'> and thus also the transformed array.c https://github.com/lab-cosmo/scikit-matter/blob/766e3dabc42f26727b484b37fcd696ad64a9c222/src/skmatter/preprocessing/_data.py#L152 Then in the metrics code the @ operation between masks is executed as an elementwise multiplication which results in the error because of not broadcastability of the shapes

Just replace np.ma.* with regular np.*

rosecers commented 1 year ago

Unfortunately, np.average flags an error with scikit learn (scikit learn does not like the name average, even if we rename upon import). Unless there's another function we can use, we might need to make a spoof

On Thu, May 18, 2023, 07:09 Alexander Goscinski @.***> wrote:

In the StandartFlexibleScaler np.ma.* is used which transforms the mean to a <class 'numpy.ma.core.MaskedArray'> and thus also the transformed array.c

https://github.com/lab-cosmo/scikit-matter/blob/766e3dabc42f26727b484b37fcd696ad64a9c222/src/skmatter/preprocessing/_data.py#L152 Then in the metrics code the @ operation between masks is executed as an elementwise multiplication which results in the error because of not broadcastability of the shapes

Just replace np.ma. with regular np.

— Reply to this email directly, view it on GitHub https://github.com/lab-cosmo/scikit-matter/pull/196#issuecomment-1552961173, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALKVP3TLPJJBZPYP2QLHIZDXGYGPBANCNFSM6AAAAAAYFTRLLE . You are receiving this because you authored the thread.Message ID: @.***>

agoscinski commented 1 year ago

I dont fully understand the problem with np.average, could you elaborate. Where does scikit-learn return an error?

rosecers commented 1 year ago

Everything passes! Ready for a proper review with one or two outstanding q's:

rosecers commented 1 year ago

I should note what many of the necessary fixes were:

hurricane642 commented 1 year ago

I have a question here, shouldn't we explicitly add tests with @parametrize_with_checks in our test sets? Or the requirement that it should just be possible to run such tests?

agoscinski commented 1 year ago

technically, the sample selectors inherit from BaseEstimator, but do not follow the estimator behavior to a tee (their transformed objects do not have the same number of rows as their inputs). In order to delay the selector-refactor conversation, I suggest we leave them out of test_check_estimators

If they are okay with it, sure. We can write in the doc also that it is not compatible with Pipeline yet, if this helps convincing them. In the end we need to mark this somehow programmatically. I am not sure how to do this, since scikit-learn Pipeline never checks for the type of the transformers, so changing the base class does not help here. Maybe the contributors have some useful suggestions.

Should orthogonalregression inherit from base estimator? @agoscinski

Yes it is an estimator so it should inherit from it, what is the problem with it? That it does not always agrees with the shape of the input, right? We can make mark it as private class (_OrthogonalRegression), because it is only use for reconstruction measures, then I think we can ommit it for the estimator check. But it basically has the same problems as the sample selection classes, so I would do the apply the same solution.

rosecers commented 1 year ago

I have a question here, shouldn't we explicitly add tests with @parametrize_with_checks in our test sets? Or the requirement that it should just be possible to run such tests?

Either way is fine, but I think having everything in a dedicated script makes the paper trail easier to follow

rosecers commented 1 year ago

@agoscinski and @hurricane642 can you do another formal review?