vnmabus commented 5 years ago

Description

All tests in in check_estimator are disabled if X_types does not include "2darray". Some tests such as check_get_params_invariance or check_set_params do not require a fit step, and should be run no matter what.

I am currently building a library that uses its own data types, but tries to be Scikit-learn compatible. Thus, I try to make estimators with the same API as Scikit-learn, but different input and output types. It would be useful to check API conformance using check_estimator with our estimator objects.

amueller commented 5 years ago

What's your input type? We actually would ideally like to be able to fit estimators within check_estimator with more general input data. We have to fix that anyway to run checks on CountVectorizer or DictVectorizer for example.

amueller commented 5 years ago

also see #12491 #6715

vnmabus commented 5 years ago

What's your input type? We actually would ideally like to be able to fit estimators within check_estimator with more general input data. We have to fix that anyway to run checks on CountVectorizer or DictVectorizer for example.

It is a custom object, representing an array of functional observations. We want to be scikit-learn compatible to reuse the pipeline and parameter search machinery, and also because with some transformations (e.g. dimensionality reduction) our data becomes multivariate, so we can use standard scikit-learn estimators after that step in the pipeline.

amueller commented 5 years ago

For using parameter search you need to be able to do cross-validation, so sklearn needs to know how to index it. That works for any iterable but you would need to think about how that works for your data.

amueller commented 5 years ago

You can look at the tests and send a PR to allow running the tests that don't require fit.

jnothman commented 5 years ago

We can certainly benefit from help making the tests and tags fit tighter with the kinds of things that people want to test!

rth commented 5 years ago

I think the following common tests should run for any X_types:

check_parameters_default_constructible
check_no_attributes_set_in_init
check_estimators_pickle (might be difficult if we don't know how to fit the estimator)
check_get_params_invariance
check_set_params

rtavenar commented 4 years ago

Hi there,

Not sure if I should post that here or in #6715 ...

As mentioned by @rth before, we have a similar context in tslearn: our input data (X) are 3d arrays (time series datasets of shape (n_time_series, n_timestamps, n_features)). We are willing to be sklearn-compatible for the same reasons as @vnmabus .

At the moment, our solution consists in monkey-patching sklearn checks in our lib (eg. check_clustering there). One typical issue we are facing is that some of the tests performed by sklearn use hard-coded datasets. As an example, at the moment, check_clustering checks whether the estimator can actually cluster data that are generated by make_blobs and ensure a minimum level of accuracy (adjusted Rand score greater than .4). It could be very beneficial for packages that depend on sklearn to have a way to define the datasets to be used for checking.

jnothman commented 4 years ago

So we were discussing possibly having a data factory optionally associated with each estimator. There are several options for how that might look in practice. Would be interesting to see a pull request.

rtavenar commented 4 years ago

That sounds like a great idea. Is there someone working actively on it? How could I help?

adrinjalali commented 4 years ago

16756 is one of the places we're discussing it. Still a bit in brainstorming stage, but will be working on it soon.

amueller commented 4 years ago

My intention in designing this was to hard-code the types used by sklearn (but that hasn't happened so far). A factory would certainly be more generic, though I'm not entirely sure how that would work in general. Because some tests check thing about very specific samples, say duplicating a sample. Maybe we should check first for which tests this is reasonably possible.

The other question is of course how to provide the factory. The easiest would probably be to provide an argument to check_estimator that contains all the factories needed?

rtavenar commented 4 years ago

Well, that would definitely be a pretty convenient way for our use-case. Should I list the different factories that would be required?

adrinjalali commented 4 years ago

If you could go through all the tests we have and see what we create and see how that compares to what you could generate, would be nice.

rtavenar commented 4 years ago

OK, here is what I can see from estimator_checks.py:

check_estimators_dtype
- a single dataset (y is currently derived from X -- subset of X cast to np.int32, should not be required)
check_fit_score_takes_y, check_sample_weights_list, check_sample_weights_not_an_array, check_sample_weights_shape, check_sample_weights_invariance, check_estimators_fit_returns_self, check_dtype_object, check_pipeline_consistency, check_estimators_overwrite_params, check_estimators_pickle, check_n_features_in, check_dont_overwrite_parameters, check_methods_subset_invariance, check_fit2d_predict1d, check_outliers_fit_predict, check_estimators_unfitted, check_estimators_data_not_an_array, check_clusterer_compute_labels_predict, check_clustering, check_estimators_partial_fit_n_features, check_non_transformer_estimators_n_iter, check_transformer_data_not_an_array, check_transformer_general, check_transformers_unfitted, check_regressor_data_not_an_array, check_regressors_no_decision_function, check_supervised_y_2d, check_supervised_y_no_nan, check_classifier_data_not_an_array, check_classifiers_one_label, check_classifiers_classes, check_classifiers_train
- a single dataset (y required)
check_sample_weights_pandas_series
- a single dataset (X should be cast-able to pd.DataFrame and y to pd.Series)
check_complex_data
- a single dataset of complex numbers (y required) -> could be generated from a dataset of real numbers by just adding 1j to all (if we assume the + operator should be available for any dataset format)?
check_estimators_empty_data_messages
- a dataset of 0 samples and a dataset of 0 features (maybe a list of all possible configurations of "empty" datasets -- typically when input data is 3d, maybe we should provide 3 datasets)
check_estimators_nan_inf
- a dataset of with 1 NaN and a dataset with 1 inf
check_nonsquare_error, check_outliers_train
- a non squared dataset (y required)
check_sparsify_coefficients
- not clear to me if the dataset is cherry-picked so that it should lead to sparse regression coefficients or not, if so, maybe this one could be difficult to generalize
check_estimator_sparse_data
- sparse dataset (could be generated from a basic dataset by just setting some part of it to 0 before sparse.csr_matrix)
check_fit_non_negative
- a dataset with at least one negative value (y required)
check_fit_idempotent
- two datasets (y required) or a single if we assume that the input should be splittable using ShuffleSplit
check_dict_unchanged
- a dataset (y required), a specific dataset is provided for RANSACRegressor
check_fit1d
- by definition, checks if a 1d input array raises an error, maybe it would not make much sense to provide specific input data?
check_fit2d_1feature
- a dataset with only 1 feature
check_fit2d_1sample
- a dataset with only 1 sample
check_transformer_n_iter
- has a special case for CROSS_DECOMPOSITION for which the dataset is hard-coded, otherwise uses a standard dataset
check_regressors_train
- an easy clustering dataset
check_classifiers_regression_target
- a regression dataset
check_regressor_multioutput
- a multi-output regression dataset
check_class_weight_classifiers
- an easy classification dataset
check_classifier_multioutput, check_classifiers_multilabel_representation_invariance
- a multi-output classification dataset

rtavenar commented 4 years ago

OK, so I have provided an overview in the previous post. My conclusion is that most of the checks could be performed using the following datasets:

a clustering dataset
a regression dataset
a multi-output regression dataset
a classification dataset
a multi-output classification dataset
a non-squared dataset (or it could be a constraint to enforce on the previously-listed datasets)

The following datasets are also useful, should we ask for them too?

datasets with only one row / column
datasets with 0 row / column (maybe could be a list of all possible sorts of "empty" datasets -- eg. for 3d arrays, there are 3 configurations to be tested)

NB: if we make extra assumptions, we can run even more checks:

if X is a dataset, its type makes it possible to compute X.min() (necessary in many checks for estimators that only take positive input)
if X[0, 0] is always accessible, then check_estimators_nan_inf is OK
if X can be involved in operations with complex numbers (eg. X_ = (1 + 1j) * X) then check_complex_data is OK
datasets

rtavenar commented 4 years ago

Would that be of interest if I opened a PR trying to implement this idea of dataset factory?

If so, should that factory be handled as a tag?

rth commented 4 years ago

Thanks for investigating @rtavenar ! I think a prototype implementation could be useful.

The other question is of course how to provide the factory. The easiest would probably be to provide an argument to check_estimator that contains all the factories needed?

I imagine we could do something like,

add dataset_factory=None parameter to all check_*
define default_dataset_factory(X_types="2darray", multioutput=False) function (parameters to be determined) that would be used by default when dataset_factory==None. X_types value is obtained from estimator tags.
see if we can make it work with existing common checks
Document how to extend this function for dtype other than 2darray
Likely add this dataset factory to check_estimator and parametrized_with_tests (this might interact with https://github.com/scikit-learn/scikit-learn/pull/17361 somewhat).

WDYT?

rtavenar commented 4 years ago

Great ! I ll work on it and come back to you as soon as I have something.

scikit-learn / scikit-learn

All tests in `check_estimator` are disabled if `X_types` does not include `"2darray"` #14057

Description

16756 is one of the places we're discussing it. Still a bit in brainstorming stage, but will be working on it soon.