SLEP 8: Propagating feature names

jorisvandenbossche commented 5 years ago

With much delay, I did a quick clean-up of the draft I wrote beginning of March at the end of the sprint. So here is an initial version of the SLEP on propagating feature names through pipelines.

The PR implementing it is https://github.com/scikit-learn/scikit-learn/pull/13307

adrinjalali commented 5 years ago

There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet).

amueller commented 5 years ago

There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet).

can you elaborate? I don't remember that part.

adrinjalali commented 5 years ago

can you elaborate? I don't remember that part.

I think it was specifically in the context of NLP related usecases where the whole "dictionary" becomes the features and it may be very memory intensive to store them. IIRC @jnothman raised the concern.

jnothman commented 5 years ago

I think it was specifically in the context of NLP related usecases where the whole "dictionary" becomes the features and it may be very memory intensive to store them. IIRC @jnothman https://github.com/jnothman raised the concern.

Well our countvectorizer is currently quite naive about this, wrt ngrams, but we would be forcing this idea of storing feature names in sparse spaces on people who have been using other vectorization tools.

amueller commented 5 years ago

There's no discussion of what the vectorizers do with their input feature names, if anything. Is that even allowed?

amueller commented 5 years ago

Coming back to this, I feel like I now favor a more invasive approach. What I'm mostly thinking about right now is how feature names can enter an estimator. I feel if they enter any other way than fit, the user might call fit with a mismatching X and names and results will be inconsistent. So there would be a benefit in providing feature names only via fit. If X is a dataframe, that's easy. If X is not a dataframe, I can see two options, both of which are very invasive: a) add a feature_names_in parameter to fit (everywhere) b) Create a ndarry subclass that has a feature_name attribute that stores the feature names.

a) requires a lot of code changes but is pretty nice otherwise, while b) requires no code changes to the fit methods, but creates a new custom class, which could be pretty confusing.

If we have always the feature names in fit, we can also do the transformation in fit, and so the user never has to call any method.

I feel that it would be good if we can eliminate having a public method. That means if you need to change your feature names, you have to refit again. An expert might use private methods to prevent this, but I think it's not that important a use-case. I think it's more important to ensure that feature names and actual computation are in sync.

Therefore my preferred solution right now: 1) Ensure feature_names_in is available during fit 2) Set feature_names_out at the end of fit 3) have the pipeline pass the out from the previous step to the in from the next step 4) profit

The main question is then how to implement 1) in the case of numpy arrays, and I think the three options are setting it beforehand, passing it in as an argument, and passing it in as a custom class.

Given that 2) requires touching every fit (of a transformer) anyway, maybe passing it in as an argument is easiest? I'm unsure about adding a custom class, but I don't really like the "set it before fit" that is currently implemented because it's very implicit and side-effect-y.

amueller commented 5 years ago

The downside of passing the feature names into fit as separate argument is that it requires special-casing passthrough in the pipeline which is a mess :-/ but well..

jnothman commented 5 years ago

I don't think special-casing passthrough is as big a deal as changing the current convention that things passed to fit are sample-aligned. That affects a whole bunch of meta-estimators, including any defined outside of sklearn, doesn't it?

Passing the names to fit is by far the most explicit design. I am curious to know what subclassing looks like... But it makes me think that if it weren't for sparse matrices, we'd be better off just using DataFrames for interchange with column names, since wrapping and unwrapping should be fast with homogeneous dtype...

amueller commented 5 years ago

@jnothman I thought there was a concern that pandas might become a column store and wrapping and unwrapping become non-trivial operations? @jorisvandenbossche surely knows more.

That would be kind of "pandas in"-"pandas out" then, but only applied to X - which would allow a lot of things already.

Yes the passthrough is not actually a big deal, I was just annoyed I actually have to think about the code ;)

Not having sample-aligned fit parameters certainly breaks with tradition. If non-sklearn meta-estimators handle them as kwargs they should assume they are sample aligned, so this will break backward-compatibility.

Which might show that allowing kwargs wasn't / isnt a good way to do sample props? In a way passing things through fit is basically "attaching properties to features/columns", only it looks to me as if the routing questions are much easier (or just different) than with sample props.

We could go even a step further and do feature_props={'names': feature_names} but for now that seems like unnecessary indirections.

I haven't finished writing the subclassing thing, but I think it's pretty trivial. We add a function make_named_array(X, feature_names) and that creates an object of type NamedNdarray that has an attribute feature_names and we expect the user to do that if they don't want to use pandas as input, and we wrap the output of transformers with that. It's basically the same as pandas-in, pandas-out, only that we ensure there's really zero copy and it's future-proof as long as we rely on numpy.

BTW any of these designs get rid of the weird subestimator discovery I had to do, because now everything is done in fit, as it was meant to be ;)

amueller commented 5 years ago

Asked pandas: https://github.com/pandas-dev/pandas/issues/27211

amueller commented 5 years ago

Also asked xarray: https://github.com/pydata/xarray/issues/3077

I think the answer from pandas is as I remembered which is they might change to using 1d slices, and while conversion to pandas might be possible, coming back is not possible without a copy.

It seems a bit unnatural to me if we'd produce DataArrays if someone passes in a DataFrame but if DataFrame is not an option then we should consider it.

amueller commented 5 years ago

@jorisvandenbossche also brought up the option of natively supporting pandas in some estimators and not requiring conversion at all. That might be possible for many of the preprocessing transformers, but not for all. It's certainly desirable to avoid copies, but I don't think it'll provide a full solution to the feature names issue. Also, I assume it will require a lot of code for many of the estimators as pandas dataframes and numpy arrays likely require different codepaths even for simple things like scaling.

tldr; not casting pandas to numpy when we don't have to would be nice, but it probably won't solve the feature name issue.

adrinjalali commented 5 years ago

If we go down the path of having a NamedNdarray, then why not keep the sample props also right attached to the array? If we do that, it starts getting closer to the xarray's implementation, so to me it seems like a better solution to just support/use the xarray.DataArray. (They even have a Dataset object BTW).

@jnothman do you see major drawbacks to using an xarray.DataArray like object to pass around sample props? I know it'd be tricky to handle the case where we want a scorer or an estimator to ignore a sample property but the prop is there, but that can be handled by a metaestimator removing those attributes before feeding the data to the estimator.

@amueller are you already working on a NamedNdarray like solution?

amueller commented 5 years ago

Stephan said in https://github.com/pydata/xarray/issues/3077 that it's a core property of xarray to do zero copy for numeric dtypes, so I think it would be a feasible candidate.

@adrinjalali attaching sample props to X wouldn't solve the routing issues, right, as in where to use sample_weights etc? So there's definitely some benefit but the hard problems are not actually addressed :-/

I haven't started implementing a NamedNDarray solution and I think right now I would prefer xarray.DataArray. I'm still not convinced that fit arguments are bad, though they have backward compatibility issues as @jnothman pointed out.

Right now, ColumnTransformer works on xarray.DataArray if we call the column dimension columns. Is that something we would want to enforce? That would allow us to write similar code to support pandas and xarray, but it might be a bit odd from an xarray perspective? From an sklearn view it would probably make the most sense to call the two dimensions samples and features, but then we would need to do more duck-typing for xarray vs pandas.

jorisvandenbossche commented 5 years ago

I think xarray is a perfect fit for having a numpy array with column names, I would rather go that way than doing a custom subclass. Using xarray for this would make it a necessary dependency for the feature. Scikit-learn would start outputting DataArrays from transformers, also if your input data is pandas DataFrames (while up to now this was always numpy arrays / scipy matrix). That would make the integration tighter compared to the ducktyping we have now with pandas, although I suppose it will certainly still be possible to code in a way to not make it a hard dependency. Although I would find it small downside that it's only an optional requirement that enables such a core feature. Also note that xarray has pandas as dependency, implicitly making pandas a dependency as well ..

There is also the problem of sparse matrices (which can eg be outputted by OneHotEncoder), which do not (yet?) fit in DataArray (I don't know if there has been work to fit pydata/sparse into an xarray.DataArray).

In general, I also like the idea of doing the work in fit. That will make different usage patterns more consistent, and is in line with things we are starting to do with eg the n_features_in_.

So trying to form an overview of the current discussed options:

1) Handling the propagation outside fit

This is basically how it is described in the current version of the slep.
Advantages: less / no changes required in the fit method of all estimators
Implementation:
- There would be a "update feature names" method (which doesn't necessarily need to be public) that can transform input to output features names. And then it the user or a calling class (eg Pipeline) that is responsible for calling this. 2) Handling the propagation completely inside fit
Advantages:
- The input feature names are available at fit time for other purposes (eg column selection in ColumnTransformer, for validation of input, ...)
- .. ?
Implementation: the fitmethod gets the input feature names information, and is responsible for setting the feature_names_in_, transforming the feature names and setting feature_names_out_.
Options for receiving the feature names:
- Embedded in the X data: DataFrame or DataArray. In this case, the fit method is also responsible of wrapping the output in a class that embeds the feature names (in case of a transformer). This is explored in PR https://github.com/scikit-learn/scikit-learn/pull/14315/. This would require little changes in Pipeline?
- other options (eg passing as keyword) is put under item 3) below. 3) Handling the transformation inside fit but propagation outside fit
Advantage of option 2, but without the need for an "array class with names" to embed the names in the X data.
Implementation: the fit method is responsible for transforming input -> output feature names and setting the features_names_out_ attribute, but it is the user or the calling class (eg Pipeline) that is responsible for passing the input feature names. This means that after a first pipeline step where eg DataFrame gets converted to arrays, this is fine as the feature names are also passed through by the pipeline separate from the X data. Options for passing the feature names to fit:
- The pipeline / calling class sets the feature_names_in_ and the fit method checks for this (this is maybe in conflict with sklearn's design principles, as not the estimator itself would be setting a "fitted" param, even before the estimator is actually fitted)
- Keyword to fit: the fit method then sets feature_names_in_, and it is the pipeline that passes the feature names to the fit method based on the feature_names_out_ of the previous step (this is explored in PR https://github.com/scikit-learn/scikit-learn/pull/14238)

jorisvandenbossche commented 5 years ago

@jorisvandenbossche also brought up the option of natively supporting pandas in some estimators and not requiring conversion at all. That might be possible for many of the preprocessing transformers, but not for all. It's certainly desirable to avoid copies, but I don't think it'll provide a full solution to the feature names issue.

Yes, I brought it up in the discussion around avoiding copies, which can be valuable separately from the discussion in this PR. It is indeed certainly not a solution to the feature names propagation (or certainly not a realistic short term solution, even if the data copying was not a problem).

amueller commented 5 years ago

If we go this route, I think we need to make xarray, and therefore pandas, a mandatory dependency. Or at least I think this would be the most reasonable way to do it. In particular if the user passes a pandas dataframe, we should return an DataArray. I have no idea how to do that without xarray being a hard dependency. I'll see if I can find the DataArray people. If we only have feature names in the dense case I'm not too mad tbh, though it's annoying for text data.

amueller commented 5 years ago

I just talked with @nvictus working on https://github.com/pydata/xarray/pull/3117

That adds support for the pydata sparse ndarray. It looks like it should not be too hard to add support for scipy sparse matrices. In the slightly more distant future that might be possible via the __array_function__ protocol but that's not feasible for now because it requires too recent numpy.

adrinjalali commented 5 years ago

And we still do have the option of making a NamedArray. xarray uses the pandas' index classes for the indexing and stuff, which is something we really don't need. They will probably also explode on their feature-set since they seem to want to support the usual database like operations which pandas does, and we really don't need/want them.

So I wouldn't mind having an internal class at all.

shoyer commented 5 years ago

It looks like it should not be too hard to add support for scipy sparse matrices.

To be clear, if xarray did support SciPy sparse matrices I suspect we would do by internally coercing into pydata/sparse arrays. Our APIs are really designed around N-dimensional arrays. Fortunately I think it's possible to convert back and forth without copying any data.

And we still do have the option of making a NamedArray. xarray uses the pandas' index classes for the indexing and stuff, which is something we really don't need.

I think this is totally reasonable and could be a nice light-weight option depending on the extent of your labeled-data needs. It's pretty easy to imagine automatic conversion from xarray/pandas/numpy objects into NamedArray and support for explicit conversions the other direction with to_pandas/to_xarray/__array__.

rth commented 5 years ago

[..] by internally coercing into pydata/sparse arrays. Our APIs are really designed around N-dimensional arrays. Fortunately I think it's possible to convert back and forth without copying any data.

Doesn't pydata/sparse only support COO sparse format in 2D? In scikit-learn we mostly work with CSR or CSC, and some copies (albeit of only part of the index) would be then necessary for the conversion I think.

shoyer commented 5 years ago

Doesn't pydata/sparse only support COO sparse format in 2D? In scikit-learn we mostly work with CSR or CSC, and some copies (albeit of only part of the index) would be then necessary for the conversion I think.

It looks like CSR/CSC support in pydata/sparse is a work in progress: https://github.com/pydata/sparse/pull/258

GaelVaroquaux commented 5 years ago

My two cents:

Maybe we could use a data structure with names of columns, both to simplify passing information around in our code base and to help reporting results to users.

In my opinion, this structure should behave as close as possible as an ndarray, because it is really the model of scikit-learn. This is unlike pandas (and I believe xarray), which indices and does alignment with them, and also has different indexing rules.

For usability, we do not want to confuse or surprise the user: he/she should have a clear mental picture of what comes out of transformers, and this mental picture should coincide with the way we manipulate data.

For these reasons, I think that it might be interesting to explore a very minimalistic ndarray subclass, as Andy has started. There are several dangers of this endeavour. First is feature creep: if this object gets complicated, it's definitely out of scope of scikit-learn. The simple answer to this is: it should stay simple. Second is compatibility: I frown at the idea of creating yet another container. However, I do not believe that such container exists: pandas dataframes of xarray DataArrays have data-manipulation logic that deviate more of less strongly from numpy arrays. I feel that such an object is direly needed in the ecosystem. Xarray itself acknowledge some inspiration from https://github.com/BIDS/datarray . The problem is that the various projects that have provided this container have specialized it further to there needs. It would be great if we could achieve some consolidation in the ecosystem.

amueller commented 5 years ago

@shoyer I much appreciate your input!

@GaelVaroquaux thanks for chiming in, and I'm pretty sure your opinion is worth more than $0.02 ;)

Re feature creep: the two other obvious things we might want to store are sample props and other feature props, i.e. row meta data and column meta data, as described on our roadmap.

Having row and column meta-data would be slimmer than the dataset class that @adrinjalali proposed (I think) but would be more complex than a thin wrapper.

I'm not sure if we should also take @jorisvandenbossche's proposal to not coerce dataframes to numpy arrays in preprocessing into account here, though it's somewhat orthogonal.

amueller commented 5 years ago

https://github.com/BIDS/datarray#prior-art seems an interesting list. larry seems closest to what we want, though is unmaintained.

Also interesting is Table from data8, but it's a pure column store: https://github.com/data-8/datascience/blob/master/datascience/tables.py

jorisvandenbossche commented 5 years ago

I'm not sure if we should also take @jorisvandenbossche's proposal to not coerce dataframes to numpy arrays in preprocessing into account here, though it's somewhat orthogonal.

I think that it is something worth exploring in itself, but will never be a complete solution for this discussion, so I would keep it separate.

amueller commented 5 years ago

Fun question for our custom subclass: does our class support nd-arrays? Probably, right? Or will it revert back to being a numpy array when you reshape it?

I think I have to look into the numpy internals way more to understand where we would need to hook in to make sure we update or error if the shape is changed.

shoyer commented 5 years ago

In my opinion, this structure should behave as close as possible as an ndarray, because it is really the model of scikit-learn. This is unlike pandas (and I believe xarray), which indices and does alignment with them, and also has different indexing rules.

The biggest difference between xarray.DataArray and numpy.ndarray is how we do broadcasting: broadcasting is done by dimension name instead of by integer position. This is arguably much more intuitive, but definitely means you cannot simply substitute NumPy -> Xarray and expect everything to work.

Otherwise, xarray's API is very similar to NumPy, much more so than pandas:

Indexing like array[] works like NumPy, with the exception of broadcasting for vectorized indexing.
If you add coordinate labels, you also get automatic alignment, but coordinate labels are entirely optional.
One difference is that aggregations like array.mean() default to NaN skipping versions.

I think there is potentially room in the ecosystem for a simpler labeled array project, e.g., only with labeled axes/dimensions but not tick/coordinates labels. The challenge is deciding where to draw the line. My experience has been that there is a fundamental trade-off between compatibility with unlabeled array APIs (NumPy) and label based expressiveness.

Xarray chose to go "all in" on label-based expressiveness. This makes it too heavy-weight for some use-cases, but it does provide strong rewards for specifying labels, in the form of a large (and growing) label powered API.

For return values from scikit-learn estimators, I like the light-weight model of patsy.DesignMatrix, which is just a numpy array plus metadata that gets lost in any operation:

>>> dmatrix("x1 + x2", data)
DesignMatrix with shape (8, 3)
  Intercept        x1        x2
          1   1.76405  -0.10322
          1   0.40016   0.41060
          1   0.97874   0.14404
          1   2.24089   1.45427
          1   1.86756   0.76104
          1  -0.97728   0.12168
          1   0.95009   0.44386
          1  -0.15136   0.33367
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)
    'x2' (column 2)

>>> x + 1
 array([[2.        , 2.76405235, 0.89678115],
       [2.        , 1.40015721, 1.4105985 ],
       [2.        , 1.97873798, 1.14404357],
       [2.        , 3.2408932 , 2.45427351],
       [2.        , 2.86755799, 1.76103773],
       [2.        , 0.02272212, 1.12167502],
       [2.        , 1.95008842, 1.44386323],
       [2.        , 0.84864279, 1.33367433]])

shoyer commented 5 years ago

I think I have to look into the numpy internals way more to understand where we would need to hook in to make sure we update or error if the shape is changed.

I'm not sure this is possible. NumPy has never had a complete subclassing API, which is part of the reason why classes like np.matrix have never worked particularly well. In general, we (NumPy developers) have been trying to discourage writing ndarray subclasses in favor of duck arrays. These are a lot more work to implement but they rule out the possibility of any unexpected interactions with NumPy's internals.

amueller commented 5 years ago

@shoyer it actually looks quite simple to enforce column consistency with __array_finalize__ unless I'm missing something: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.subclassing.html

Looks like DesignMatrix is a subclass.

If it's discouraged we can obviously to the duck array. The easiest way to do that would be to just implement __array__ and everything would be working but everything would lose the meta-data, right?

I'd rather have scalar operations or row-wise operations keep the column names. Basically doing zero mean, unit variance should keep the column names. That would actually make implementing the forwarding of column names much easier for us as well. Concatenating and slicing columns should probably also keep the column names. I hadn't though of these before, but not having them will be a pain for users, and for us. If we do this right, the feature name propagation in our preprocessing could be somewhat automatic. I think I'm gonna prototype this; probably trying to start with duck arrays.

shoyer commented 5 years ago

it actually looks quite simple to enforce column consistency with __array_finalize__ unless I'm missing something

__array_finalize__ is called in most but not all cases. Probably the most notable exception is np.concatenate, which silently casts subclasses to base ndarrays.

A bigger issue (and there are others) is that __array_finalize__ doesn't know anything about the operation that was performed. For example, if you transpose an array, all you know is what the final shape is. So if you have a square matrix, you might blindly copy axis labels to the result even if they should be swapped.

You could fix this particular issue by implementing your own transpose method (and/or np.transpose itself with __array_function__), but it should give you a flavor of what it's like to implement a subclass: everything in NumPy "works" but not necessarily in the right way. Even if it does work right, it's not necessarily for the right reason -- it is easy to inadvertently rely on implementation details in a subclass.

👍 for testing this out with a duck array.

The easiest way to do that would be to just implement __array__ and everything would be working but everything would lose the meta-data, right?

Yes, that would make every NumPy function work.

amueller commented 5 years ago

Thinking about it more I came to the same conclusion, thank you for your input! __array_finalize__ indeed seems hard to make work "right".

adrinjalali commented 5 years ago

Answering some comments from https://github.com/scikit-learn/scikit-learn/pull/14315 here, seems more appropriate:

@thomasjpfan

We can duck type this: https://www.numpy.org/devdocs/user/basics.dispatch.html

I did go through that page as well, but if you read the NEP, you see that it's pretty new. I don't think our minimum numpy requirements meet those requirements at all. We basically need to be on 1.17 for it to work, I think. @shoyer am I correct?

With a custom NamedArray, we can have "namedarray in -> namedarray out" for every estimator.

That is what I'm trying to achieve in this PR (at least to give users a method so that they can achieve that).

@amueller

we can actually make it "just work" for many cases and don't need to implement get_feature_names in those cases!!!

Do you mean to keep the feature names there and if the transformer doesn't do anything to mess with the columns, to keep the column names? I think it's possible, but still at the end of the transformer we need to check if the column names are still there or not, since with many operations the column names should rightfully be gone. That's why I'm not sure if it's worth the effort.

what were the issues with subclassing?

it's not straightforward at all, and there are at least two different ways of doing it as @shoyer mentioned. And I also checked some other implementations and they're also not happy with their implementations (the new duck typing method seems much better though, but more work).

With subclassing, I did manage to make it work as long as only one side of the operation is a NamedArray (when the two NamedArrays have mismatching column names it gets complicated), and it'd keep the column names for simple operations and slices. Also, that'd mean the slice operation is much slower since we're hooking into __getitem__, and we'd need to expose something like iloc or something to do fast slicing on the array itself, which means the rest of the codebase would need to use it. At the end, it seemed like what I put in this PR now (the very simple container thingy), would be a more reasonable solution maybe.

The easiest way to do that would be to just implement __array__ and everything would be working but everything would lose the meta-data, right?

That's what I did in https://github.com/scikit-learn/scikit-learn/pull/14315/files#diff-5b184308506279093e1d9b367d5a8643 , and almost all common tests now pass with it. Still dealing with some FeaureUnion issues.

shoyer commented 5 years ago

We can duck type this: https://www.numpy.org/devdocs/user/basics.dispatch.html

I did go through that page as well, but if you read the NEP, you see that it's pretty new. I don't think our minimum numpy requirements meet those requirements at all. We basically need to be on 1.17 for it to work, I think. @shoyer am I correct?

If you're not willing to commit to NumPy 1.17+, then it isn't possible to make either a duck-array or subclass that is preserved by all NumPy operations. Operations like np.concatenate will always coerce your objects into base NumPy arrays.

This is part of why xarray implements it's own functions/methods, e.g., xarray.where vs numpy.where. The other reason is that in some cases we have intentionally chosen to implement different semantics, e.g., xarray.dot automatically sums over axes with the same name, unlike numpy.dot which sums along the last axis of the first argument and the second to last axis of the second argument.

amueller commented 5 years ago

I think it's possible, but still at the end of the transformer we need to check if the column names are still there or not, since with many operations the column names should rightfully be gone. That's why I'm not sure if it's worth the effort.

Why not? I think it's weird and error-prone to have to reimplement what happens during the preprocessing to the columns. Having the same code handle the actual transformation and the transformation of the names seems way more straight-forward. Yes, we'd still need to do something at the end of fit, but we'd save implementing get_feature_names for many cases, I think.

Also, that'd mean the slice operation is much slower since we're hooking into getitem, and we'd need to expose something like iloc or something to do fast slicing on the array itself, which means the rest of the codebase would need to use it.

I think with the new ducktyping interface we could do that without sing __getitem__ or iloc but if we can't use that it's going to be hard.

amueller commented 5 years ago

Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;)

adrinjalali commented 5 years ago

Yes, we'd still need to do something at the end of fit, but we'd save implementing get_feature_names for many cases, I think.

I'm not sure if I understand correctly. ATM, feature_names_out_ is the same as the given input feature names if it's a one to one mapping, and the output would also include those feature names.

If the output feature names are not supposed to be the same as the input feature names, I'm not sure how you'd have the NamedArray itself generate the new column names.

What I was trying to say, is that we can have a simple NamedArray implementation, keep it experimental, have the pipeline and transformers work with the feature names, and then have a more complicated NamedArray if we want.

I guess at this point I'd appreciate your feedback on https://github.com/scikit-learn/scikit-learn/pull/14315 and to know if you think it's a right direction or not.

GaelVaroquaux commented 5 years ago

Operations like np.concatenate will always coerce your objects into base NumPy arrays.

I don't think that this is a huge deal: overall, I don't think that we can expect every processing code to consistently keep the information of the names. Things will break, more of less frequently in the code. Striving to have everything working transparently is a very difficult problem, and not the one that we need to solve. What we need is to add code, at the output of a transformer, that fixes these problems.

amueller commented 5 years ago

If the output feature names are not supposed to be the same as the input feature names, I'm not sure how you'd have the NamedArray itself generate the new column names.

SimpleImputer and feature selections drop columns, so they could work easily. Even if there's a one-to-one correspondence, we need to reattach the feature names to the output if we want to hand them on.

I'm happy to start with a simple solution and iterate from there, though. I agre with @GaelVaroquaux that preserving the feature names is not a problem we need to fix, it would just be nice, and I hadn't realized earlier that it's a possibility.

@adrinjalali I'll check out your proposal tomorrow or early next week :)

jorisvandenbossche commented 5 years ago

BTW, I am offline for a week, but feel free to push updates with a description of the alternatives to this branch

hermidalc commented 5 years ago

My suggestion, and sorry I didn’t read every comment here, but that passing feature names isn’t enough, plenty of use cases where you want multiple feature metadata not just the name. Since this is an enhancement might as well go for more extensive feature metadata support because I don’t believe it’s more difficult logic-wise. As discussed in other threads why not just pass around a feature properties array-like DataFrame or numpy and have it subsetted with each feature selection step in the Pipeline and available in fit methods, just like sample properties can via **fit_params?

GaelVaroquaux commented 5 years ago

Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;)

Late to the party (catching up): no problem on the numpy 1.7 requirement.

shoyer commented 5 years ago

Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;) Late to the party (catching up): no problem on the numpy 1.7 requirement.

I think the requirement is actually NumPy 1.17 for __array_function__?

GaelVaroquaux commented 5 years ago

Late to the party (catching up): no problem on the numpy 1.7 requirement.
I think the requirement is actually NumPy 1.17 for __array_function__?

Yes, at some point (after writing this), my brain reconnected and became less dumb.

I'm very much not in favor of to bumping the requirement to 1.17 for 1beta.

It's important to have in mind that a core contribution of scikit-learn to the ecosystem is good numerical algorithms. In my view, it is the most important one. Sugar on top that makes these good numerical algorithms easier to use is good. However, it is no use without the good numerics.

Making the numerical algorithms harder to access (for instance with hard-to-meet requirements) for sugar on top is narrowing their impact on scientific and data applications.

adrinjalali commented 5 years ago

I'm not sure how using __array_function__ would hamper people's access to the numerics @GaelVaroquaux . People can still rely on older sklearn if they really want to use old numpy. Besides, if we're talking about v1.0, we should be able to think about such changes.

jorisvandenbossche commented 5 years ago

What's the status of this SLEP? Is it blocked by #22?

I don't think it is directly blocked by #22 (n_features_in_) (there are a few overlapping aspects, eg regarding naming, but I think that is not touching the essential discussion here).

I think the status is that we should update the SLEP with the alternative being discussed above (or, in case that is preferred, write an alternative SLEP for it, but personally I think it would be good to describe the possible options in this SLEP).

adrinjalali commented 5 years ago

I think the alternative which we mostly agree with, is the one proposed in https://github.com/scikit-learn/scikit-learn/pull/14315. We either need a new slep, or to update this one to reflect the the ideas there.

GaelVaroquaux commented 5 years ago

People can still rely on older sklearn if they really want to use old numpy.

Telling users that they can either use old versions of sklearn or upgrade everything makes it hard for users to have a buffer where they progressively do upgrades. In an ideal world, upgrades should be minor events that we do, like cleaning our room (I'm terrible at cleaning my room). The problem is when upgrades become large endeavors. First, we need to schedule significant time slots for them, second more things are likely to go wrong together. For instance: if upgrading scikit-learn triggers a significant upgrade in numpy, which itself triggers a significant upgrade in pandas, which itself comes with behavior changes. It's then harder for the user to audit the changes on his analysis or production tasks. The user is more likely to delay the upgrade, and the problem becomes worse.

Besides, if we're talking about v1.0, we should be able to think about such changes.

It's all a question of leaving the buffer: if v1.0 happens in two years, then yes.

I think that an ideal situation would be if we can gradually ramp up the requirements of scikit-learn. In other terms: if our requirements are versions that are old enough for users to have had time to adjust, typically because they are installed by the rolling upgrade of company policy (which is unfortunately something that varies widely).

amueller commented 5 years ago

@GaelVaroquaux My idea was to have a config option that explicitly enables the new behavior and that this config would fail with older numpy (i.e. the config option would have a soft dependency on numpy 1.17) I would definitely not trigger an update.

If you have an idea how to implement something like names arrays without numpy 1.17 I think we're all ear. The alternative would be to implement feature names via a different mechanism.

amueller commented 5 years ago

@hermidalc I agree that more metadata would be nice, but actually we haven't even solved this problem with **fit_parms (see #16). Adding additional metadata later will probably be relatively straight-forward if/when we agree on a mechanism. What do you do with the meta-data for PolynomialFeatures or PCA?

scikit-learn / enhancement_proposals

SLEP 8: Propagating feature names #18