scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.13k stars 25.41k forks source link

All tests in `check_estimator` are disabled if `X_types` does not include `"2darray"` #14057

Open vnmabus opened 5 years ago

vnmabus commented 5 years ago

Description

All tests in in check_estimator are disabled if X_types does not include "2darray". Some tests such as check_get_params_invariance or check_set_params do not require a fit step, and should be run no matter what.

I am currently building a library that uses its own data types, but tries to be Scikit-learn compatible. Thus, I try to make estimators with the same API as Scikit-learn, but different input and output types. It would be useful to check API conformance using check_estimator with our estimator objects.

amueller commented 5 years ago

What's your input type? We actually would ideally like to be able to fit estimators within check_estimator with more general input data. We have to fix that anyway to run checks on CountVectorizer or DictVectorizer for example.

amueller commented 5 years ago

also see #12491 #6715

vnmabus commented 5 years ago

What's your input type? We actually would ideally like to be able to fit estimators within check_estimator with more general input data. We have to fix that anyway to run checks on CountVectorizer or DictVectorizer for example.

It is a custom object, representing an array of functional observations. We want to be scikit-learn compatible to reuse the pipeline and parameter search machinery, and also because with some transformations (e.g. dimensionality reduction) our data becomes multivariate, so we can use standard scikit-learn estimators after that step in the pipeline.

amueller commented 5 years ago

For using parameter search you need to be able to do cross-validation, so sklearn needs to know how to index it. That works for any iterable but you would need to think about how that works for your data.

amueller commented 5 years ago

You can look at the tests and send a PR to allow running the tests that don't require fit.

jnothman commented 5 years ago

We can certainly benefit from help making the tests and tags fit tighter with the kinds of things that people want to test!

rth commented 5 years ago

I think the following common tests should run for any X_types:

check_parameters_default_constructible
check_no_attributes_set_in_init
check_estimators_pickle (might be difficult if we don't know how to fit the estimator)
check_get_params_invariance
check_set_params
rtavenar commented 4 years ago

Hi there,

Not sure if I should post that here or in #6715 ...

As mentioned by @rth before, we have a similar context in tslearn: our input data (X) are 3d arrays (time series datasets of shape (n_time_series, n_timestamps, n_features)). We are willing to be sklearn-compatible for the same reasons as @vnmabus .

At the moment, our solution consists in monkey-patching sklearn checks in our lib (eg. check_clustering there). One typical issue we are facing is that some of the tests performed by sklearn use hard-coded datasets. As an example, at the moment, check_clustering checks whether the estimator can actually cluster data that are generated by make_blobs and ensure a minimum level of accuracy (adjusted Rand score greater than .4). It could be very beneficial for packages that depend on sklearn to have a way to define the datasets to be used for checking.

jnothman commented 4 years ago

So we were discussing possibly having a data factory optionally associated with each estimator. There are several options for how that might look in practice. Would be interesting to see a pull request.

rtavenar commented 4 years ago

That sounds like a great idea. Is there someone working actively on it? How could I help?

adrinjalali commented 4 years ago

16756 is one of the places we're discussing it. Still a bit in brainstorming stage, but will be working on it soon.

amueller commented 4 years ago

My intention in designing this was to hard-code the types used by sklearn (but that hasn't happened so far). A factory would certainly be more generic, though I'm not entirely sure how that would work in general. Because some tests check thing about very specific samples, say duplicating a sample. Maybe we should check first for which tests this is reasonably possible.

The other question is of course how to provide the factory. The easiest would probably be to provide an argument to check_estimator that contains all the factories needed?

rtavenar commented 4 years ago

Well, that would definitely be a pretty convenient way for our use-case. Should I list the different factories that would be required?

adrinjalali commented 4 years ago

If you could go through all the tests we have and see what we create and see how that compares to what you could generate, would be nice.

rtavenar commented 4 years ago

OK, here is what I can see from estimator_checks.py:

rtavenar commented 4 years ago

OK, so I have provided an overview in the previous post. My conclusion is that most of the checks could be performed using the following datasets:

The following datasets are also useful, should we ask for them too?

NB: if we make extra assumptions, we can run even more checks:

rtavenar commented 4 years ago

Would that be of interest if I opened a PR trying to implement this idea of dataset factory?

If so, should that factory be handled as a tag?

rth commented 4 years ago

Thanks for investigating @rtavenar ! I think a prototype implementation could be useful.

The other question is of course how to provide the factory. The easiest would probably be to provide an argument to check_estimator that contains all the factories needed?

I imagine we could do something like,

  1. add dataset_factory=None parameter to all check_*
  2. define default_dataset_factory(X_types="2darray", multioutput=False) function (parameters to be determined) that would be used by default when dataset_factory==None. X_types value is obtained from estimator tags.
  3. see if we can make it work with existing common checks
  4. Document how to extend this function for dtype other than 2darray
  5. Likely add this dataset factory to check_estimator and parametrized_with_tests (this might interact with https://github.com/scikit-learn/scikit-learn/pull/17361 somewhat).

WDYT?

rtavenar commented 4 years ago

Great ! I ll work on it and come back to you as soon as I have something.