SLEP012 input_feature_names_

thomasjpfan commented 4 years ago

In SLEP012, it suggests a new attribute, input_feature_names_, onto the estimators:

https://github.com/scikit-learn/enhancement_proposals/blob/master/slep012/proposal.rst#L25

Is this in scope of the SLEP?

adrinjalali commented 4 years ago

I kinda wanted to have it about the DataArray only at first, but through discussions we decided to include that in the SLEP.

thomasjpfan commented 4 years ago

It looks like the two motivations for the InputArray is for:

Input feature names for model inspection, which pushes for defining input_feature_names_.
Allow estimators to treat columns differently. (For fairness or categories)

Would voting for SLEP 012 indirectly vote for input_feature_names_? If not, the SLEP vote is kind of for "infrastructure". In other words, it will enable us to work on the two items above.

(BTW I trying to see what we need to do to get SLEP 012 to be ready to vote on)

Is there any other use cases you see for InputArray?

adrinjalali commented 4 years ago

I guess I need to give a bit of a context.

The history of feature names "feature" (:D) seems something like the following to me (please correct me if I'm missing something @amueller and @jorisvandenbossche ):

a get_feature_names is introduced in some of the estimators. In most estimators it's the output feature names, but if implemented in a predictor, or a pipeline which has a predictor as the last step, it's the input features to the predictor (hence kind of inconsistent).
Propagating the feature names through a pipeline in order to be able to inspect the feature names going into each step, can be done if we allow get_feature_names to accept an existing set of feature names (from the previous step, in the pipeline), and having the get_feature_names of the pipeline doing that for us, would allow the user to inspect the features going to the last step. This would be done once the fit is called. Some of the drawbacks: get_feature_names not clear if it's the input or output, only works in a pipeline, estimators don't have access to the feature names at fit time (@amueller has an implementation of this).
Pipeline propagates the feature names at fit time, getting the output feature names of each step, passing it to the next step's fit. This could either be done with get_feature_names or the feature_names_in_ and feature_names_out_. The latter makes it clear which one's which. It also means estimators have access to the feature names during fit. However, it's quite a bit of change in the pipeline, and also requires us to pass an especial feature_names arg to fit, which doesn't seem the nicest design (@amueller has an implementation of this).
Building on the previous step, we could use a data structure which includes the feature names as the output of transform and pass that to fit, and have estimators understand feature names from that. I had an implementation of this using xarray and we seemed pretty okay with the implementation. It also doesn't touch the pipeline process, since it's transparent to the pipeline, or other meta-estimators. The main issue with this solution is dependency on xarray and hence pandas.
Further developing the idea, we had a few discussions about developing a new data structure to keep the feature names alongside the data, and that's how the DataArray or InputArray came to be. It has the downside of introducing yet another data structure to the ecosystem, and that we'd need to maintain it. We also contemplated the idea of ideally pushing it out of sklearn in the future to have it a standalone package. The upside is that we'd avoid xarray and/or pandas dependency. It's also more of a backward compatible implementation since any operation on our object would result in a numpy array.

Understanding this history, makes it kinda clear (I hope) why the SLEP includes feature_names_in_. If you look at the SLEP's history, you'll see that I started it as the infrastructure slep as you mention, but the reviewers were not happy with it since it didn't have enough motivation, and that motivation would include the introduction of feature names.
The latest development of the idea is to have feature names implemented using xarray and/or pandas, but with a soft dependency, i.e. users would be able to have/enable the feature only if they have the dependency installed. From my conversations with a few of the core-devs, it seems it's a solution which would rather easily get a consensus.

With the above background in mind, I'm hoping that the last solution would gain consensus, and I hope the implementation is not too complicated. There are a few bits which need to be discussed, such as:

xarray or pandas? I very much prefer xarray, Andy prefers pandas, and there are cons and pros on either side.
global config or a local one, to enable feature names
is it enabled by default if user passes a data structure containing feature names
what's the default value of those configs
probably off b default (on both accounts) for at least a few releases since returning the new data structure is not backward compatible with people's existing code

The parts where we seem to already do have a consensus on are (I think):

having both feature_names_in_ and feature_names_out_
deprecating get_feature_names
transformers returning a data structure which includes the feature names, if feature names are enabled

If we move towards the latest solution (which I support), then I'd withdraw SLEP012 in favor of the new solution.

amueller commented 4 years ago

Some additions: 1) It's right now always the output feature names, so it's consistent, though potentially unexpected.

2) This got a bit better since we can now slice pipelines and we could do pipe[:-1].get_feature_names(). That doesn't solve issues for other meta-estimators (if we want feature names there) though.

3) If we do this with get_feature_names() it has similar limit limitations wrt other meta-estimators like 2). It's true that I mixed the fit-time aspect with the feature_names_in_ and feature_names_out_ parts. These are somewhat orthogonal but together they help consistently addressing all meta-estimators.

4) It's not entirely clear whether this requires adding feature_names_in_. Probably it does.

I agree with the summary. Though I think one aspect you're missing is that @jnothman (and also @thomasjpfan) have concerns that having feature_names_in_ and feature_names_out_ will mean storing potentially generating and storing a lot of strings.

Will the HashingVectorizer create 1M strings at fit time independent of the input? Will they be copied at every step in a pipeline? We could not have feature names for sparse data. But then we can't really deprecate get_feature_names in CountVectorizer.

Having it be opt-in might solve some of these issue, but that means you need to decide ahead of time whether you might want to look at the feature names or not.

scikit-learn / enhancement_proposals

SLEP012 input_feature_names_ #35