scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
48 stars 34 forks source link

DOC Rename 'default' to 'native' for set_output SLEP #78

Closed thomasjpfan closed 2 years ago

thomasjpfan commented 2 years ago

Based on the discussion in https://github.com/scikit-learn/scikit-learn/pull/23734#pullrequestreview-1135967664, reviewers concluded that "default" was not descriptive enough. The PR was updated to use "native" instead.

adrinjalali commented 2 years ago

So "native" means convert to numpy.ndarray or scipy.sparse? I personally find "default" more descriptive in this case. I'm not sure what's native about numpy/scipy from the perspective of a user.

glemaitre commented 2 years ago

Just for the context, refer to https://github.com/scikit-learn/scikit-learn/pull/23734#discussion_r985071192

Both "default" and "native" are not passepartout names. "Native" would imply that this is the native format given by the estimator that could optionally depend on the inputs. I would feel that "default" should mean a single type of output.

But still, both terms are really generic and are used to define something that is ill-defined (if not reading the estimators' documentation).

Bottom line, I don't have a strong opinion and I don't have a better proposal :)

adrinjalali commented 2 years ago

Would "unchanged" be a better name there? As in, we don't try to change the output from whatever's produced by the transformer.

glemaitre commented 2 years ago

It was a proposal from @thomasjpfan and I would also be fine with it.

lorentzenchr commented 2 years ago

As I already said

I'm happy with any other default value than "default", which I would consider bad design because it does not tell anything about the actual behaviour.

That means I favor a string that gives a hint at what it actually does. The meaning is - please correct me: A scikit-learn transformer outputs numpy.ndarray or scipy.sparse. This does not guarantee anything about 3rd party transformers.

How about "ndarray_or_sparse"? For sure, I‘m also ok with "unchanged ".

thomasjpfan commented 2 years ago

With the Array API PR, scikit-learn can also output cupy.array_api.Array, so the more descriptive name is "array-like_or_sparse".

I do not really like "unchanged" anymore. For example, it leads to this UX:

# By default, `est` returns an ndarray
est.transform(X_df) # this is an ndarray

est.set_output(transform="pandas")
est.transform(X_df) # this is a dataframe

est.set_output(transform="unchanged")
est.transform(X_df) # this is a ndarray??

I originally choose "default" to mean:

Maybe... "not-configured", "original", or "initial"?

lorentzenchr commented 2 years ago

Some options for the default value of set_output with the meaning "output what you get out of estimator.transform() without modifying anything". Please add your choice.

nr default value advantage disadvantage
1 "default" concise, no SLEP change needed unspecific: default of what?
2 "native" concise unspecific: native to what?
3 "unchanged" transform does change the input, e.g. a df as input may output a ndarray
4 "not-configured"
5 "original"
6 "initial"
7 "array-like_or_sparse" very specific long (maybe wrong for 3rd party estimators)
8 "estimator_default" clear meaning long
9 None pythonic default value not an option as it is used as sentinel value internally
lorentzenchr commented 2 years ago

After a good 🏃, I withdraw my concerns for "default". If this option has consensus among the core devs, let's go with it and document well what it means exactly.

amueller commented 2 years ago

so... close and leave as-is? I think in the end none of the words easily captures what's going on and folks will have to read the docs

glemaitre commented 2 years ago

Agreed with @amueller. Closing then.