Closed ulupo closed 3 years ago
Yes :heart_eyes: ! (I would vote for the solution based on the factory pattern )
@wreise: @gtauzin and I had a chat about this and we agreed that given that some aspects of the second solution are difficult to test comprehensively in a scikit-learn
context (dynamical signature and class name changes in particular), and given also that the first solution allows for the estimator type to be treated as a hyperparameter (good for some users), we'll be pushing the first version for the time being, but look for a way of integrating something like the second one too in the future. I'll make a PR.
I think the issue encountered by @lewtun here is too bad and it shows that we should add some utilities for the user to be able to promote
scikit-learn
transformers (which act on 2D data) to transformers acting on collections of 2D arrays.It would be quite easy to provide this sort of functionality to the user, and I can see at least two ways.
A metaestimator taking a transformer instance as argument
The idea here is to provide a metaestimator class in a style similar to this example in
scikit-learn
: https://scikit-learn.org/stable/auto_examples/cluster/plot_inductive_clustering.html#sphx-glr-auto-examples-cluster-plot-inductive-clustering-py. This could look as follows:Notes:
fit
method only performs input validation: hyperparameter validation is out of reach generically because we don't know what transformer will be passed.transform
method. Sofit
can only be quite trivial, andtransform
does not exist (this is a bit like the situation in Mapper, where we don't know what it means to transform new data given a Mapper graph built in training).joblib
parallelism must surely become too important to give up. This proposal allows the user to choose the number of jobs at the level of the collection, as well as to give a soft hint on the backend to use (https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).if_delegate_has_method
is essential here, but it does not seem to hurt.A factory for transformers, taking a
scikit-learn
transformer class as argumentHere, we provide the user instead with a factory for producing "collection-wise" versions of arbitrary transformers. The user would e.g. set
PCACollection = apply_transformer_to_collection(PCA)
, and then instantiates it with the same parameters asPCA
, e.g.pca_for_collection = PCACollection(n_components=3)
. The other points are as before. Here, more work had to be made to fix the__init__
signature of the wrapper class at runtime to make it the same asPCA
.Notes:
apply_transformer_to_collection
renames the wrapper classForEachInput
at runtime to make it more meaningful to the user.apply_transformer_to_collection(PCA)
would have class nameForEachInputPCA
, for instance.scikit-learn
'sget_params
andset_params
.Main differences
With the first version, we have an easier to maintain, less bug-prone and more
scikit-learn
-looking solution. However, the second solution has a more functional flavour which could be appealing to some -- having a transformer factory makes for slightly clearer thinking and code layout if one is set on which transformer should be used.Users who want to e.g. grid-search across transformer classes, and not just parameters of a fixed class, will find the first solution easier to work with. But if they are set on which class to use, the grid would look a little more cumbersome as they have to access deep parameters by prepending the class name each time, as in
PCA__n_components
.See also #108.