Closed thomasjpfan closed 2 years ago
So "native"
means convert to numpy.ndarray or scipy.sparse? I personally find "default"
more descriptive in this case. I'm not sure what's native about numpy/scipy from the perspective of a user.
Just for the context, refer to https://github.com/scikit-learn/scikit-learn/pull/23734#discussion_r985071192
Both "default" and "native" are not passepartout names. "Native" would imply that this is the native format given by the estimator that could optionally depend on the inputs. I would feel that "default" should mean a single type of output.
But still, both terms are really generic and are used to define something that is ill-defined (if not reading the estimators' documentation).
Bottom line, I don't have a strong opinion and I don't have a better proposal :)
Would "unchanged"
be a better name there? As in, we don't try to change the output from whatever's produced by the transformer.
It was a proposal from @thomasjpfan and I would also be fine with it.
As I already said
I'm happy with any other default value than "default", which I would consider bad design because it does not tell anything about the actual behaviour.
That means I favor a string that gives a hint at what it actually does. The meaning is - please correct me: A scikit-learn transformer outputs numpy.ndarray or scipy.sparse. This does not guarantee anything about 3rd party transformers.
How about "ndarray_or_sparse"
? For sure, I‘m also ok with "unchanged "
.
With the Array API PR, scikit-learn can also output cupy.array_api.Array
, so the more descriptive name is "array-like_or_sparse"
.
I do not really like "unchanged"
anymore. For example, it leads to this UX:
# By default, `est` returns an ndarray
est.transform(X_df) # this is an ndarray
est.set_output(transform="pandas")
est.transform(X_df) # this is a dataframe
est.set_output(transform="unchanged")
est.transform(X_df) # this is a ndarray??
I originally choose "default"
to mean:
Maybe... "not-configured"
, "original"
, or "initial"
?
Some options for the default value of set_output
with the meaning "output what you get out of estimator.transform() without modifying anything". Please add your choice.
nr | default value | advantage | disadvantage |
---|---|---|---|
1 | "default" |
concise, no SLEP change needed | unspecific: default of what? |
2 | "native" |
concise | unspecific: native to what? |
3 | "unchanged" |
transform does change the input, e.g. a df as input may output a ndarray | |
4 | "not-configured" |
||
5 | "original" |
||
6 | "initial" |
||
7 | "array-like_or_sparse" |
very specific | long (maybe wrong for 3rd party estimators) |
8 | "estimator_default" |
clear meaning | long |
9 | None |
pythonic default value | not an option as it is used as sentinel value internally |
After a good 🏃, I withdraw my concerns for "default"
. If this option has consensus among the core devs, let's go with it and document well what it means exactly.
so... close and leave as-is? I think in the end none of the words easily captures what's going on and folks will have to read the docs
Agreed with @amueller. Closing then.
Based on the discussion in https://github.com/scikit-learn/scikit-learn/pull/23734#pullrequestreview-1135967664, reviewers concluded that "default" was not descriptive enough. The PR was updated to use
"native"
instead.