Open vnmabus opened 5 years ago
What's your input type? We actually would ideally like to be able to fit estimators within check_estimator with more general input data. We have to fix that anyway to run checks on CountVectorizer or DictVectorizer for example.
also see #12491 #6715
What's your input type? We actually would ideally like to be able to fit estimators within check_estimator with more general input data. We have to fix that anyway to run checks on CountVectorizer or DictVectorizer for example.
It is a custom object, representing an array of functional observations. We want to be scikit-learn compatible to reuse the pipeline and parameter search machinery, and also because with some transformations (e.g. dimensionality reduction) our data becomes multivariate, so we can use standard scikit-learn estimators after that step in the pipeline.
For using parameter search you need to be able to do cross-validation, so sklearn needs to know how to index it. That works for any iterable but you would need to think about how that works for your data.
You can look at the tests and send a PR to allow running the tests that don't require fit.
We can certainly benefit from help making the tests and tags fit tighter with the kinds of things that people want to test!
I think the following common tests should run for any X_types
:
check_parameters_default_constructible
check_no_attributes_set_in_init
check_estimators_pickle (might be difficult if we don't know how to fit the estimator)
check_get_params_invariance
check_set_params
Hi there,
Not sure if I should post that here or in #6715 ...
As mentioned by @rth before, we have a similar context in tslearn
: our input data (X) are 3d arrays (time series datasets of shape (n_time_series, n_timestamps, n_features)
). We are willing to be sklearn-compatible for the same reasons as @vnmabus .
At the moment, our solution consists in monkey-patching sklearn checks in our lib (eg. check_clustering
there). One typical issue we are facing is that some of the tests performed by sklearn use hard-coded datasets. As an example, at the moment, check_clustering
checks whether the estimator can actually cluster data that are generated by make_blobs
and ensure a minimum level of accuracy (adjusted Rand score greater than .4). It could be very beneficial for packages that depend on sklearn to have a way to define the datasets to be used for checking.
So we were discussing possibly having a data factory optionally associated with each estimator. There are several options for how that might look in practice. Would be interesting to see a pull request.
That sounds like a great idea. Is there someone working actively on it? How could I help?
My intention in designing this was to hard-code the types used by sklearn (but that hasn't happened so far). A factory would certainly be more generic, though I'm not entirely sure how that would work in general. Because some tests check thing about very specific samples, say duplicating a sample. Maybe we should check first for which tests this is reasonably possible.
The other question is of course how to provide the factory. The easiest would probably be to provide an argument to check_estimator
that contains all the factories needed?
Well, that would definitely be a pretty convenient way for our use-case. Should I list the different factories that would be required?
If you could go through all the tests we have and see what we create and see how that compares to what you could generate, would be nice.
OK, here is what I can see from estimator_checks.py
:
check_estimators_dtype
np.int32
, should not be required)check_fit_score_takes_y
, check_sample_weights_list
, check_sample_weights_not_an_array
, check_sample_weights_shape
, check_sample_weights_invariance
, check_estimators_fit_returns_self
, check_dtype_object
, check_pipeline_consistency
, check_estimators_overwrite_params
, check_estimators_pickle
, check_n_features_in
, check_dont_overwrite_parameters
, check_methods_subset_invariance
, check_fit2d_predict1d
, check_outliers_fit_predict
, check_estimators_unfitted
, check_estimators_data_not_an_array
, check_clusterer_compute_labels_predict
, check_clustering
, check_estimators_partial_fit_n_features
, check_non_transformer_estimators_n_iter
, check_transformer_data_not_an_array
, check_transformer_general
, check_transformers_unfitted
, check_regressor_data_not_an_array
, check_regressors_no_decision_function
, check_supervised_y_2d
, check_supervised_y_no_nan
, check_classifier_data_not_an_array
, check_classifiers_one_label
, check_classifiers_classes
, check_classifiers_train
check_sample_weights_pandas_series
pd.DataFrame
and y to pd.Series
)check_complex_data
+
operator should be available for any dataset format)?check_estimators_empty_data_messages
check_estimators_nan_inf
check_nonsquare_error
, check_outliers_train
check_sparsify_coefficients
check_estimator_sparse_data
sparse.csr_matrix
)check_fit_non_negative
check_fit_idempotent
ShuffleSplit
check_dict_unchanged
RANSACRegressor
check_fit1d
check_fit2d_1feature
check_fit2d_1sample
check_transformer_n_iter
CROSS_DECOMPOSITION
for which the dataset is hard-coded, otherwise uses a standard datasetcheck_regressors_train
check_classifiers_regression_target
check_regressor_multioutput
check_class_weight_classifiers
check_classifier_multioutput
, check_classifiers_multilabel_representation_invariance
OK, so I have provided an overview in the previous post. My conclusion is that most of the checks could be performed using the following datasets:
The following datasets are also useful, should we ask for them too?
NB: if we make extra assumptions, we can run even more checks:
X
is a dataset, its type makes it possible to compute X.min()
(necessary in many checks for estimators that only take positive input)X[0, 0]
is always accessible, then check_estimators_nan_inf
is OKX
can be involved in operations with complex numbers (eg. X_ = (1 + 1j) * X
) then check_complex_data
is OKWould that be of interest if I opened a PR trying to implement this idea of dataset factory?
If so, should that factory be handled as a tag?
Thanks for investigating @rtavenar ! I think a prototype implementation could be useful.
The other question is of course how to provide the factory. The easiest would probably be to provide an argument to check_estimator that contains all the factories needed?
I imagine we could do something like,
dataset_factory=None
parameter to all check_*
default_dataset_factory(X_types="2darray", multioutput=False)
function (parameters to be determined) that would be used by default when dataset_factory==None
. X_types value is obtained from estimator tags.2darray
check_estimator
and parametrized_with_tests
(this might interact with https://github.com/scikit-learn/scikit-learn/pull/17361 somewhat).WDYT?
Great ! I ll work on it and come back to you as soon as I have something.
Description
All tests in in
check_estimator
are disabled ifX_types
does not include"2darray"
. Some tests such ascheck_get_params_invariance
orcheck_set_params
do not require afit
step, and should be run no matter what.I am currently building a library that uses its own data types, but tries to be Scikit-learn compatible. Thus, I try to make estimators with the same API as Scikit-learn, but different input and output types. It would be useful to check API conformance using
check_estimator
with our estimator objects.