Open thomasjpfan opened 4 years ago
I kinda wanted to have it about the DataArray
only at first, but through discussions we decided to include that in the SLEP.
It looks like the two motivations for the InputArray
is for:
input_feature_names_
.Would voting for SLEP 012 indirectly vote for input_feature_names_
? If not, the SLEP vote is kind of for "infrastructure". In other words, it will enable us to work on the two items above.
(BTW I trying to see what we need to do to get SLEP 012 to be ready to vote on)
Is there any other use cases you see for InputArray
?
I guess I need to give a bit of a context.
The history of feature names "feature" (:D) seems something like the following to me (please correct me if I'm missing something @amueller and @jorisvandenbossche ):
a get_feature_names
is introduced in some of the estimators. In most estimators it's the output feature names, but if implemented in a predictor, or a pipeline which has a predictor as the last step, it's the input features to the predictor (hence kind of inconsistent).
Propagating the feature names through a pipeline in order to be able to inspect the feature names going into each step, can be done if we allow get_feature_names
to accept an existing set of feature names (from the previous step, in the pipeline), and having the get_feature_names
of the pipeline doing that for us, would allow the user to inspect the features going to the last step. This would be done once the fit
is called. Some of the drawbacks: get_feature_names
not clear if it's the input or output, only works in a pipeline, estimators don't have access to the feature names at fit
time (@amueller has an implementation of this).
Pipeline propagates the feature names at fit time, getting the output feature names of each step, passing it to the next step's fit
. This could either be done with get_feature_names
or the feature_names_in_
and feature_names_out_
. The latter makes it clear which one's which. It also means estimators have access to the feature names during fit
. However, it's quite a bit of change in the pipeline, and also requires us to pass an especial feature_names
arg to fit
, which doesn't seem the nicest design (@amueller has an implementation of this).
Building on the previous step, we could use a data structure which includes the feature names as the output of transform
and pass that to fit
, and have estimators understand feature names from that. I had an implementation of this using xarray
and we seemed pretty okay with the implementation. It also doesn't touch the pipeline process, since it's transparent to the pipeline, or other meta-estimators. The main issue with this solution is dependency on xarray
and hence pandas
.
Further developing the idea, we had a few discussions about developing a new data structure to keep the feature names alongside the data, and that's how the DataArray
or InputArray
came to be. It has the downside of introducing yet another data structure to the ecosystem, and that we'd need to maintain it. We also contemplated the idea of ideally pushing it out of sklearn in the future to have it a standalone package. The upside is that we'd avoid xarray
and/or pandas
dependency. It's also more of a backward compatible implementation since any operation on our object would result in a numpy
array.
Understanding this history, makes it kinda clear (I hope) why the SLEP includes feature_names_in_
. If you look at the SLEP's history, you'll see that I started it as the infrastructure slep as you mention, but the reviewers were not happy with it since it didn't have enough motivation, and that motivation would include the introduction of feature names.
The latest development of the idea is to have feature names implemented using xarray
and/or pandas
, but with a soft dependency, i.e. users would be able to have/enable the feature only if they have the dependency installed. From my conversations with a few of the core-devs, it seems it's a solution which would rather easily get a consensus.
With the above background in mind, I'm hoping that the last solution would gain consensus, and I hope the implementation is not too complicated. There are a few bits which need to be discussed, such as:
xarray
or pandas
? I very much prefer xarray
, Andy prefers pandas
, and there are cons and pros on either side.The parts where we seem to already do have a consensus on are (I think):
feature_names_in_
and feature_names_out_
get_feature_names
If we move towards the latest solution (which I support), then I'd withdraw SLEP012 in favor of the new solution.
Some additions: 1) It's right now always the output feature names, so it's consistent, though potentially unexpected.
2) This got a bit better since we can now slice pipelines and we could do pipe[:-1].get_feature_names()
. That doesn't solve issues for other meta-estimators (if we want feature names there) though.
3) If we do this with get_feature_names()
it has similar limit limitations wrt other meta-estimators like 2). It's true that I mixed the fit-time aspect with the feature_names_in_
and feature_names_out_
parts. These are somewhat orthogonal but together they help consistently addressing all meta-estimators.
4) It's not entirely clear whether this requires adding feature_names_in_
. Probably it does.
I agree with the summary. Though I think one aspect you're missing is that @jnothman (and also @thomasjpfan) have concerns that having feature_names_in_
and feature_names_out_
will mean storing potentially generating and storing a lot of strings.
Will the HashingVectorizer
create 1M strings at fit time independent of the input? Will they be copied at every step in a pipeline? We could not have feature names for sparse data. But then we can't really deprecate get_feature_names
in CountVectorizer
.
Having it be opt-in might solve some of these issue, but that means you need to decide ahead of time whether you might want to look at the feature names or not.
In SLEP012, it suggests a new attribute,
input_feature_names_
, onto the estimators:https://github.com/scikit-learn/enhancement_proposals/blob/master/slep012/proposal.rst#L25
Is this in scope of the SLEP?