RFC Support for Some Developer Utilities

thomasjpfan commented 4 years ago

On the Utilities for Developers page, it states:

Warning: These utilities are meant to be used internally within the scikit-learn package. They are not guaranteed to be stable between versions of scikit-learn. Backports, in particular, will be removed as the scikit-learn dependencies evolve.

If we want to provide utilities to support third-party estimators, we should treat some of these utilities as "first class" citizens.

For example safe_indexing would be extremely useful for third parties that want to support DataFrames as input. Currently, the options for third-party developers is to build their own "safe_indexing" or depend on our private version which may not be stable.

Another example is https://github.com/scikit-learn/enhancement_proposals/pull/22, which defines a n_features_in_ contract where we will internally use private methods to cohere with the contract. Third-party estimators would need to build their own methods or functions to work with the SLEP.

TLDR: Now that much of the utilities are "private", we can make deliberate decisions about what utilities should be public and supported by us. This would mean deprecation cycles, etc. If we support some of the utils module, it will make it easier to build estimators, which will enrich the ecosystem of scikit-learn compatible estimators.

CC @scikit-learn/core-devs

GaelVaroquaux commented 4 years ago

+1 on the general idea!

NicolasHug commented 4 years ago

Warning: These utilities are meant to be used internally within the scikit-learn package. They are not guaranteed to be stable between versions of scikit-learn. Backports, in particular, will be removed as the scikit-learn dependencies evolve.

In practice, that's not true. We've been treating all these utilities as public, and we've always deprecated things smoothly with the 2 versions rules. Case in point: safe_indexing, which we just deprecated in 0.22.

It seems that the questions are:

Do we want a developer API? If yes, what are the rules regarding deprecations?
What utilities become part of the developer API (if any), which ones go private, and which ones stay public?

I'm happy to consider a proposal :)

jnothman commented 4 years ago

Yes, we should define a developer API more clearly. It should include things that make passing check_estimator easier, and that make passing new checks added to check_estimator easier (e.g. a validate_data helper helps with n_featuresin and with error messages for X.shape[1] discrepancies; and when updated to include pandas column reordering stuff, this would be seamless for 3rd party devs).

I also think that it's safe to bring back things deprecated in 0.22.0 into 0.22.1.

glemaitre commented 4 years ago

a developer API more clearly

I'm +1 with that. Then, we should ensure some deprecation cycle if we have breaking changes but we could make it as with experimental to have fast adoption (and we can use DeprecationWarning to warn the end-user :)).

I think that we can come with a list of utilities. Another way would be to ask third-party which tools they are using?

qinhanmin2014 commented 4 years ago

In practice, that's not true. We've been treating all these utilities as public, and we've always deprecated things smoothly with the 2 versions rules. Case in point: safe_indexing, which we just deprecated in 0.22.

I guess not @NicolasHug , see #14545, we change the behaviour of check_is_fitted without a deprecation cycle (You can say that there's a deprecation cycle, but I'd argue that since we no longer support the old behaviour, the deprecation cycle there is meaningless).

Perhaps we should introduce a deprecation cycle if we want to change public functions in sklearn.utils.

thomasjpfan commented 3 years ago

With 1.0, let's see if we can revisit this topic. With the comment at https://github.com/scikit-learn/scikit-learn/issues/20695#issuecomment-894531268:

I am not sure if we should either make it public or available to developers. I think that I would prefer to define a developer API with utilities that should be used by people developing scikit-learn estimators.

My current definition of public API is "always deprecate with 2 cycles before changing behavior". We have mostly been treating items in sklearn.utils as public API under that definition.

Before we decide what are utilities are developer API, I think we need to decide what is the API contract for items listed as "developer API". For me it comes down to, "how much work does a third-party estimator developer have to do to support the latest version of sklearn". I am happy with following the 2 cycle deprecation rule as we do with "public API". If we want to add features to the developer utility, we use our standard practices for ensuring backward compatibility.

adrinjalali commented 3 years ago

Another related discussion is here: https://github.com/scikit-learn/scikit-learn/pull/20657 : how do we signal something is developer API?

glemaitre commented 3 years ago

how much work does a third-party estimator developer have to do to support the latest version of sklearn

Thinking with what we encounter with imbalanced-learn, the issue is not to support the latest version of scikit-learn but instead, be compatible with the older versions. We made the choice to only support the latest version of scikit-learn because this is easy to make the change. The pain to support the older version is that we need to backport old non-backwards compatible functionalities. I don't say that we did a good choice, but it is the least painful on our side with the available workforce at hand.

I assume that the impact of 1 cycle vs. 2 cycles will be that 1 cycle allow third-party to only support scikit-learn while 2 cycles would allow supporting the 2 latest versions easily.

glemaitre commented 3 years ago

how do we signal something is developer API?

something documented publicly but starting with an underscore :)

WenjieZ commented 3 years ago

Let me weigh in here. The point of a Developer API is to tell the developers that these functionalities are quite stable and will not undergo changes without notification and consulting the public's opinion. The developers want to know what they can depend on to write their programs, just like foreign entrepreneurs want to know the local law and legislation before investing.

glemaitre commented 3 years ago

the point of a Developer API is to tell the developers that these functionalities are quite stable and will not undergo changes without notification

I think that we agree on this. What would be interesting is to know if we can shorten it to 1 cycle or be a bit more conservative and use 2 cycles as in the public API.

adrinjalali commented 3 years ago

something documented publicly but starting with an underscore :)

What I'm trying to say, is that documentation is great, but something else would be nice as well. As in, it'd be nice for the user to inspect a class and notice certain things which are private, as opposed to other things which are a part of the developer API. That's why I'm very happy with the dundar notation in https://github.com/scikit-learn/scikit-learn/pull/20657

lorentzenchr commented 1 year ago

If we agree on the usual 2 release cycle for a developer API, then there is hardly a difference to the user API except where to find it. We could collect the developer API in a (sub)module though:

If an existing one, then sklearn.utils.
If a new one, then maybe sklearn.dev.

It would be nice to resolve this issue somehow. Maybe use the next dev meeting?

glemaitre commented 1 year ago

I like the idea to have a new module which might make it more straightforward to create documentation.

Maybe use the next dev meeting?

It would be nice. I'll support any way forward there (I will be travelling during the dev meeting).

scikit-learn / scikit-learn

RFC Support for Some Developer Utilities #15801