Closed jorisvandenbossche closed 2 years ago
There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet).
There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet).
can you elaborate? I don't remember that part.
can you elaborate? I don't remember that part.
I think it was specifically in the context of NLP related usecases where the whole "dictionary" becomes the features and it may be very memory intensive to store them. IIRC @jnothman raised the concern.
I think it was specifically in the context of NLP related usecases where the whole "dictionary" becomes the features and it may be very memory intensive to store them. IIRC @jnothman https://github.com/jnothman raised the concern.
Well our countvectorizer is currently quite naive about this, wrt ngrams, but we would be forcing this idea of storing feature names in sparse spaces on people who have been using other vectorization tools.
There's no discussion of what the vectorizers do with their input feature names, if anything. Is that even allowed?
Coming back to this, I feel like I now favor a more invasive approach.
What I'm mostly thinking about right now is how feature names can enter an estimator. I feel if they enter any other way than fit
, the user might call fit
with a mismatching X
and names and results will be inconsistent.
So there would be a benefit in providing feature names only via fit
. If X
is a dataframe, that's easy. If X
is not a dataframe, I can see two options, both of which are very invasive:
a) add a feature_names_in
parameter to fit
(everywhere)
b) Create a ndarry subclass that has a feature_name
attribute that stores the feature names.
a) requires a lot of code changes but is pretty nice otherwise, while b) requires no code changes to the fit methods, but creates a new custom class, which could be pretty confusing.
If we have always the feature names in fit
, we can also do the transformation in fit
, and so the user never has to call any method.
I feel that it would be good if we can eliminate having a public method. That means if you need to change your feature names, you have to refit again. An expert might use private methods to prevent this, but I think it's not that important a use-case. I think it's more important to ensure that feature names and actual computation are in sync.
Therefore my preferred solution right now: 1) Ensure feature_names_in is available during fit 2) Set feature_names_out at the end of fit 3) have the pipeline pass the out from the previous step to the in from the next step 4) profit
The main question is then how to implement 1) in the case of numpy arrays, and I think the three options are setting it beforehand, passing it in as an argument, and passing it in as a custom class.
Given that 2) requires touching every fit
(of a transformer) anyway, maybe passing it in as an argument is easiest? I'm unsure about adding a custom class, but I don't really like the "set it before fit" that is currently implemented because it's very implicit and side-effect-y.
The downside of passing the feature names into fit
as separate argument is that it requires special-casing passthrough
in the pipeline which is a mess :-/ but well..
I don't think special-casing passthrough is as big a deal as changing the current convention that things passed to fit are sample-aligned. That affects a whole bunch of meta-estimators, including any defined outside of sklearn, doesn't it?
Passing the names to fit is by far the most explicit design. I am curious to know what subclassing looks like... But it makes me think that if it weren't for sparse matrices, we'd be better off just using DataFrames for interchange with column names, since wrapping and unwrapping should be fast with homogeneous dtype...
@jnothman I thought there was a concern that pandas might become a column store and wrapping and unwrapping become non-trivial operations? @jorisvandenbossche surely knows more.
That would be kind of "pandas in"-"pandas out" then, but only applied to X - which would allow a lot of things already.
Yes the passthrough is not actually a big deal, I was just annoyed I actually have to think about the code ;)
Not having sample-aligned fit parameters certainly breaks with tradition. If non-sklearn meta-estimators handle them as kwargs they should assume they are sample aligned, so this will break backward-compatibility.
Which might show that allowing kwargs wasn't / isnt a good way to do sample props? In a way passing things through fit is basically "attaching properties to features/columns", only it looks to me as if the routing questions are much easier (or just different) than with sample props.
We could go even a step further and do feature_props={'names': feature_names}
but for now that seems like unnecessary indirections.
I haven't finished writing the subclassing thing, but I think it's pretty trivial. We add a function make_named_array(X, feature_names)
and that creates an object of type NamedNdarray
that has an attribute feature_names
and we expect the user to do that if they don't want to use pandas as input, and we wrap the output of transformers with that.
It's basically the same as pandas-in, pandas-out, only that we ensure there's really zero copy and it's future-proof as long as we rely on numpy.
BTW any of these designs get rid of the weird subestimator discovery I had to do, because now everything is done in fit
, as it was meant to be ;)
Asked pandas: https://github.com/pandas-dev/pandas/issues/27211
Also asked xarray: https://github.com/pydata/xarray/issues/3077
I think the answer from pandas is as I remembered which is they might change to using 1d slices, and while conversion to pandas might be possible, coming back is not possible without a copy.
It seems a bit unnatural to me if we'd produce DataArrays if someone passes in a DataFrame but if DataFrame is not an option then we should consider it.
@jorisvandenbossche also brought up the option of natively supporting pandas in some estimators and not requiring conversion at all. That might be possible for many of the preprocessing transformers, but not for all. It's certainly desirable to avoid copies, but I don't think it'll provide a full solution to the feature names issue. Also, I assume it will require a lot of code for many of the estimators as pandas dataframes and numpy arrays likely require different codepaths even for simple things like scaling.
tldr; not casting pandas to numpy when we don't have to would be nice, but it probably won't solve the feature name issue.
If we go down the path of having a NamedNdarray
, then why not keep the sample props also right attached to the array? If we do that, it starts getting closer to the xarray
's implementation, so to me it seems like a better solution to just support/use the xarray.DataArray
. (They even have a Dataset
object BTW).
@jnothman do you see major drawbacks to using an xarray.DataArray
like object to pass around sample props? I know it'd be tricky to handle the case where we want a scorer or an estimator to ignore a sample property but the prop is there, but that can be handled by a metaestimator removing those attributes before feeding the data to the estimator.
@amueller are you already working on a NamedNdarray
like solution?
Stephan said in https://github.com/pydata/xarray/issues/3077 that it's a core property of xarray to do zero copy for numeric dtypes, so I think it would be a feasible candidate.
@adrinjalali attaching sample props to X wouldn't solve the routing issues, right, as in where to use sample_weights
etc? So there's definitely some benefit but the hard problems are not actually addressed :-/
I haven't started implementing a NamedNDarray
solution and I think right now I would prefer xarray.DataArray
. I'm still not convinced that fit arguments are bad, though they have backward compatibility issues as @jnothman pointed out.
Right now, ColumnTransformer
works on xarray.DataArray
if we call the column dimension columns
. Is that something we would want to enforce? That would allow us to write similar code to support pandas and xarray, but it might be a bit odd from an xarray perspective? From an sklearn view it would probably make the most sense to call the two dimensions samples
and features
, but then we would need to do more duck-typing for xarray vs pandas.
I think xarray is a perfect fit for having a numpy array with column names, I would rather go that way than doing a custom subclass. Using xarray for this would make it a necessary dependency for the feature. Scikit-learn would start outputting DataArrays from transformers, also if your input data is pandas DataFrames (while up to now this was always numpy arrays / scipy matrix). That would make the integration tighter compared to the ducktyping we have now with pandas, although I suppose it will certainly still be possible to code in a way to not make it a hard dependency. Although I would find it small downside that it's only an optional requirement that enables such a core feature. Also note that xarray has pandas as dependency, implicitly making pandas a dependency as well ..
There is also the problem of sparse matrices (which can eg be outputted by OneHotEncoder), which do not (yet?) fit in DataArray (I don't know if there has been work to fit pydata/sparse into an xarray.DataArray).
In general, I also like the idea of doing the work in fit
. That will make different usage patterns more consistent, and is in line with things we are starting to do with eg the n_features_in_
.
So trying to form an overview of the current discussed options:
1) Handling the propagation outside fit
fit
method of all estimatorsfit
fit
method gets the input feature names information, and is responsible for setting the feature_names_in_
, transforming the feature names and setting feature_names_out_
.X
data: DataFrame or DataArray. In this case, the fit
method is also responsible of wrapping the output in a class that embeds the feature names (in case of a transformer). This is explored in PR https://github.com/scikit-learn/scikit-learn/pull/14315/.
This would require little changes in Pipeline?fit
but propagation outside fit
fit
method is responsible for transforming input -> output feature names and setting the features_names_out_
attribute, but it is the user or the calling class (eg Pipeline) that is responsible for passing the input feature names. This means that after a first pipeline step where eg DataFrame gets converted to arrays, this is fine as the feature names are also passed through by the pipeline separate from the X data.
Options for passing the feature names to fit
:
feature_names_in_
and the fit
method checks for this (this is maybe in conflict with sklearn's design principles, as not the estimator itself would be setting a "fitted" param, even before the estimator is actually fitted)fit
: the fit
method then sets feature_names_in_
, and it is the pipeline that passes the feature names to the fit method based on the feature_names_out_
of the previous step (this is explored in PR https://github.com/scikit-learn/scikit-learn/pull/14238)@jorisvandenbossche also brought up the option of natively supporting pandas in some estimators and not requiring conversion at all. That might be possible for many of the preprocessing transformers, but not for all. It's certainly desirable to avoid copies, but I don't think it'll provide a full solution to the feature names issue.
Yes, I brought it up in the discussion around avoiding copies, which can be valuable separately from the discussion in this PR. It is indeed certainly not a solution to the feature names propagation (or certainly not a realistic short term solution, even if the data copying was not a problem).
If we go this route, I think we need to make xarray, and therefore pandas, a mandatory dependency. Or at least I think this would be the most reasonable way to do it. In particular if the user passes a pandas dataframe, we should return an DataArray. I have no idea how to do that without xarray being a hard dependency. I'll see if I can find the DataArray people. If we only have feature names in the dense case I'm not too mad tbh, though it's annoying for text data.
I just talked with @nvictus working on https://github.com/pydata/xarray/pull/3117
That adds support for the pydata sparse ndarray. It looks like it should not be too hard to add support for scipy sparse matrices. In the slightly more distant future that might be possible via the __array_function__
protocol but that's not feasible for now because it requires too recent numpy.
And we still do have the option of making a NamedArray
. xarray
uses the pandas' index classes for the indexing and stuff, which is something we really don't need. They will probably also explode on their feature-set since they seem to want to support the usual database like operations which pandas does, and we really don't need/want them.
So I wouldn't mind having an internal class at all.
It looks like it should not be too hard to add support for scipy sparse matrices.
To be clear, if xarray did support SciPy sparse matrices I suspect we would do by internally coercing into pydata/sparse arrays. Our APIs are really designed around N-dimensional arrays. Fortunately I think it's possible to convert back and forth without copying any data.
And we still do have the option of making a
NamedArray
.xarray
uses the pandas' index classes for the indexing and stuff, which is something we really don't need.
I think this is totally reasonable and could be a nice light-weight option depending on the extent of your labeled-data needs. It's pretty easy to imagine automatic conversion from xarray/pandas/numpy objects into NamedArray
and support for explicit conversions the other direction with to_pandas
/to_xarray
/__array__
.
[..] by internally coercing into pydata/sparse arrays. Our APIs are really designed around N-dimensional arrays. Fortunately I think it's possible to convert back and forth without copying any data.
Doesn't pydata/sparse only support COO sparse format in 2D? In scikit-learn we mostly work with CSR or CSC, and some copies (albeit of only part of the index) would be then necessary for the conversion I think.
Doesn't pydata/sparse only support COO sparse format in 2D? In scikit-learn we mostly work with CSR or CSC, and some copies (albeit of only part of the index) would be then necessary for the conversion I think.
It looks like CSR/CSC support in pydata/sparse is a work in progress: https://github.com/pydata/sparse/pull/258
My two cents:
Maybe we could use a data structure with names of columns, both to simplify passing information around in our code base and to help reporting results to users.
In my opinion, this structure should behave as close as possible as an ndarray, because it is really the model of scikit-learn. This is unlike pandas (and I believe xarray), which indices and does alignment with them, and also has different indexing rules.
For usability, we do not want to confuse or surprise the user: he/she should have a clear mental picture of what comes out of transformers, and this mental picture should coincide with the way we manipulate data.
For these reasons, I think that it might be interesting to explore a very minimalistic ndarray subclass, as Andy has started. There are several dangers of this endeavour. First is feature creep: if this object gets complicated, it's definitely out of scope of scikit-learn. The simple answer to this is: it should stay simple. Second is compatibility: I frown at the idea of creating yet another container. However, I do not believe that such container exists: pandas dataframes of xarray DataArrays have data-manipulation logic that deviate more of less strongly from numpy arrays. I feel that such an object is direly needed in the ecosystem. Xarray itself acknowledge some inspiration from https://github.com/BIDS/datarray . The problem is that the various projects that have provided this container have specialized it further to there needs. It would be great if we could achieve some consolidation in the ecosystem.
@shoyer I much appreciate your input!
@GaelVaroquaux thanks for chiming in, and I'm pretty sure your opinion is worth more than $0.02 ;)
Re feature creep: the two other obvious things we might want to store are sample props and other feature props, i.e. row meta data and column meta data, as described on our roadmap.
Having row and column meta-data would be slimmer than the dataset class that @adrinjalali proposed (I think) but would be more complex than a thin wrapper.
I'm not sure if we should also take @jorisvandenbossche's proposal to not coerce dataframes to numpy arrays in preprocessing into account here, though it's somewhat orthogonal.
https://github.com/BIDS/datarray#prior-art seems an interesting list. larry seems closest to what we want, though is unmaintained.
Also interesting is Table from data8, but it's a pure column store: https://github.com/data-8/datascience/blob/master/datascience/tables.py
I'm not sure if we should also take @jorisvandenbossche's proposal to not coerce dataframes to numpy arrays in preprocessing into account here, though it's somewhat orthogonal.
I think that it is something worth exploring in itself, but will never be a complete solution for this discussion, so I would keep it separate.
Fun question for our custom subclass: does our class support nd-arrays? Probably, right? Or will it revert back to being a numpy array when you reshape it?
I think I have to look into the numpy internals way more to understand where we would need to hook in to make sure we update or error if the shape is changed.
In my opinion, this structure should behave as close as possible as an ndarray, because it is really the model of scikit-learn. This is unlike pandas (and I believe xarray), which indices and does alignment with them, and also has different indexing rules.
The biggest difference between xarray.DataArray
and numpy.ndarray
is how we do broadcasting: broadcasting is done by dimension name instead of by integer position. This is arguably much more intuitive, but definitely means you cannot simply substitute NumPy -> Xarray and expect everything to work.
Otherwise, xarray's API is very similar to NumPy, much more so than pandas:
array[]
works like NumPy, with the exception of broadcasting for vectorized indexing.array.mean()
default to NaN skipping versions.I think there is potentially room in the ecosystem for a simpler labeled array project, e.g., only with labeled axes/dimensions but not tick/coordinates labels. The challenge is deciding where to draw the line. My experience has been that there is a fundamental trade-off between compatibility with unlabeled array APIs (NumPy) and label based expressiveness.
Xarray chose to go "all in" on label-based expressiveness. This makes it too heavy-weight for some use-cases, but it does provide strong rewards for specifying labels, in the form of a large (and growing) label powered API.
For return values from scikit-learn estimators, I like the light-weight model of patsy.DesignMatrix
, which is just a numpy array plus metadata that gets lost in any operation:
>>> dmatrix("x1 + x2", data)
DesignMatrix with shape (8, 3)
Intercept x1 x2
1 1.76405 -0.10322
1 0.40016 0.41060
1 0.97874 0.14404
1 2.24089 1.45427
1 1.86756 0.76104
1 -0.97728 0.12168
1 0.95009 0.44386
1 -0.15136 0.33367
Terms:
'Intercept' (column 0)
'x1' (column 1)
'x2' (column 2)
>>> x + 1
array([[2. , 2.76405235, 0.89678115],
[2. , 1.40015721, 1.4105985 ],
[2. , 1.97873798, 1.14404357],
[2. , 3.2408932 , 2.45427351],
[2. , 2.86755799, 1.76103773],
[2. , 0.02272212, 1.12167502],
[2. , 1.95008842, 1.44386323],
[2. , 0.84864279, 1.33367433]])
I think I have to look into the numpy internals way more to understand where we would need to hook in to make sure we update or error if the shape is changed.
I'm not sure this is possible. NumPy has never had a complete subclassing API, which is part of the reason why classes like np.matrix
have never worked particularly well. In general, we (NumPy developers) have been trying to discourage writing ndarray
subclasses in favor of duck arrays. These are a lot more work to implement but they rule out the possibility of any unexpected interactions with NumPy's internals.
@shoyer it actually looks quite simple to enforce column consistency with __array_finalize__
unless I'm missing something: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.subclassing.html
Looks like DesignMatrix is a subclass.
If it's discouraged we can obviously to the duck array.
The easiest way to do that would be to just implement __array__
and everything would be working but everything would lose the meta-data, right?
I'd rather have scalar operations or row-wise operations keep the column names. Basically doing zero mean, unit variance should keep the column names. That would actually make implementing the forwarding of column names much easier for us as well. Concatenating and slicing columns should probably also keep the column names. I hadn't though of these before, but not having them will be a pain for users, and for us. If we do this right, the feature name propagation in our preprocessing could be somewhat automatic. I think I'm gonna prototype this; probably trying to start with duck arrays.
it actually looks quite simple to enforce column consistency with
__array_finalize__
unless I'm missing something
__array_finalize__
is called in most but not all cases. Probably the most notable exception is np.concatenate
, which silently casts subclasses to base ndarrays.
A bigger issue (and there are others) is that __array_finalize__
doesn't know anything about the operation that was performed. For example, if you transpose an array, all you know is what the final shape
is. So if you have a square matrix, you might blindly copy axis labels to the result even if they should be swapped.
You could fix this particular issue by implementing your own transpose
method (and/or np.transpose
itself with __array_function__
), but it should give you a flavor of what it's like to implement a subclass: everything in NumPy "works" but not necessarily in the right way. Even if it does work right, it's not necessarily for the right reason -- it is easy to inadvertently rely on implementation details in a subclass.
đź‘Ť for testing this out with a duck array.
The easiest way to do that would be to just implement
__array__
and everything would be working but everything would lose the meta-data, right?
Yes, that would make every NumPy function work.
Thinking about it more I came to the same conclusion, thank you for your input! __array_finalize__
indeed seems hard to make work "right".
Answering some comments from https://github.com/scikit-learn/scikit-learn/pull/14315 here, seems more appropriate:
@thomasjpfan
We can duck type this: https://www.numpy.org/devdocs/user/basics.dispatch.html
I did go through that page as well, but if you read the NEP, you see that it's pretty new. I don't think our minimum numpy requirements meet those requirements at all. We basically need to be on 1.17 for it to work, I think. @shoyer am I correct?
With a custom NamedArray, we can have "namedarray in -> namedarray out" for every estimator.
That is what I'm trying to achieve in this PR (at least to give users a method so that they can achieve that).
@amueller
we can actually make it "just work" for many cases and don't need to implement get_feature_names in those cases!!!
Do you mean to keep the feature names there and if the transformer doesn't do anything to mess with the columns, to keep the column names? I think it's possible, but still at the end of the transformer we need to check if the column names are still there or not, since with many operations the column names should rightfully be gone. That's why I'm not sure if it's worth the effort.
what were the issues with subclassing?
it's not straightforward at all, and there are at least two different ways of doing it as @shoyer mentioned. And I also checked some other implementations and they're also not happy with their implementations (the new duck typing method seems much better though, but more work).
With subclassing, I did manage to make it work as long as only one side of the operation is a NamedArray (when the two NamedArrays have mismatching column names it gets complicated), and it'd keep the column names for simple operations and slices. Also, that'd mean the slice operation is much slower since we're hooking into __getitem__
, and we'd need to expose something like iloc
or something to do fast slicing on the array itself, which means the rest of the codebase would need to use it. At the end, it seemed like what I put in this PR now (the very simple container thingy), would be a more reasonable solution maybe.
The easiest way to do that would be to just implement
__array__
and everything would be working but everything would lose the meta-data, right?
That's what I did in https://github.com/scikit-learn/scikit-learn/pull/14315/files#diff-5b184308506279093e1d9b367d5a8643 , and almost all common tests now pass with it. Still dealing with some FeaureUnion issues.
We can duck type this: https://www.numpy.org/devdocs/user/basics.dispatch.html
I did go through that page as well, but if you read the NEP, you see that it's pretty new. I don't think our minimum numpy requirements meet those requirements at all. We basically need to be on 1.17 for it to work, I think. @shoyer am I correct?
If you're not willing to commit to NumPy 1.17+, then it isn't possible to make either a duck-array or subclass that is preserved by all NumPy operations. Operations like np.concatenate
will always coerce your objects into base NumPy arrays.
This is part of why xarray implements it's own functions/methods, e.g., xarray.where
vs numpy.where
. The other reason is that in some cases we have intentionally chosen to implement different semantics, e.g., xarray.dot
automatically sums over axes with the same name, unlike numpy.dot
which sums along the last axis of the first argument and the second to last axis of the second argument.
I think it's possible, but still at the end of the transformer we need to check if the column names are still there or not, since with many operations the column names should rightfully be gone. That's why I'm not sure if it's worth the effort.
Why not? I think it's weird and error-prone to have to reimplement what happens during the preprocessing to the columns. Having the same code handle the actual transformation and the transformation of the names seems way more straight-forward.
Yes, we'd still need to do something at the end of fit, but we'd save implementing get_feature_names
for many cases, I think.
Also, that'd mean the slice operation is much slower since we're hooking into getitem, and we'd need to expose something like iloc or something to do fast slicing on the array itself, which means the rest of the codebase would need to use it.
I think with the new ducktyping interface we could do that without sing __getitem__
or iloc
but if we can't use that it's going to be hard.
Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;)
Yes, we'd still need to do something at the end of fit, but we'd save implementing
get_feature_names
for many cases, I think.
I'm not sure if I understand correctly. ATM, feature_names_out_
is the same as the given input feature names if it's a one to one mapping, and the output would also include those feature names.
If the output feature names are not supposed to be the same as the input feature names, I'm not sure how you'd have the NamedArray
itself generate the new column names.
What I was trying to say, is that we can have a simple NamedArray
implementation, keep it experimental, have the pipeline and transformers work with the feature names, and then have a more complicated NamedArray
if we want.
I guess at this point I'd appreciate your feedback on https://github.com/scikit-learn/scikit-learn/pull/14315 and to know if you think it's a right direction or not.
Operations like np.concatenate will always coerce your objects into base NumPy arrays.
I don't think that this is a huge deal: overall, I don't think that we can expect every processing code to consistently keep the information of the names. Things will break, more of less frequently in the code. Striving to have everything working transparently is a very difficult problem, and not the one that we need to solve. What we need is to add code, at the output of a transformer, that fixes these problems.
If the output feature names are not supposed to be the same as the input feature names, I'm not sure how you'd have the NamedArray itself generate the new column names.
SimpleImputer
and feature selections drop columns, so they could work easily. Even if there's a one-to-one correspondence, we need to reattach the feature names to the output if we want to hand them on.
I'm happy to start with a simple solution and iterate from there, though. I agre with @GaelVaroquaux that preserving the feature names is not a problem we need to fix, it would just be nice, and I hadn't realized earlier that it's a possibility.
@adrinjalali I'll check out your proposal tomorrow or early next week :)
BTW, I am offline for a week, but feel free to push updates with a description of the alternatives to this branch
My suggestion, and sorry I didn’t read every comment here, but that passing feature names isn’t enough, plenty of use cases where you want multiple feature metadata not just the name. Since this is an enhancement might as well go for more extensive feature metadata support because I don’t believe it’s more difficult logic-wise. As discussed in other threads why not just pass around a feature properties array-like DataFrame or numpy and have it subsetted with each feature selection step in the Pipeline and available in fit methods, just like sample properties can via **fit_params?
Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;)
Late to the party (catching up): no problem on the numpy 1.7 requirement.
Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;) Late to the party (catching up): no problem on the numpy 1.7 requirement.
I think the requirement is actually NumPy 1.17 for __array_function__
?
Late to the party (catching up): no problem on the numpy 1.7 requirement.
I think the requirement is actually NumPy 1.17 for __array_function__?
Yes, at some point (after writing this), my brain reconnected and became less dumb.
I'm very much not in favor of to bumping the requirement to 1.17 for 1beta.
It's important to have in mind that a core contribution of scikit-learn to the ecosystem is good numerical algorithms. In my view, it is the most important one. Sugar on top that makes these good numerical algorithms easier to use is good. However, it is no use without the good numerics.
Making the numerical algorithms harder to access (for instance with hard-to-meet requirements) for sugar on top is narrowing their impact on scientific and data applications.
I'm not sure how using __array_function__
would hamper people's access to the numerics @GaelVaroquaux . People can still rely on older sklearn if they really want to use old numpy. Besides, if we're talking about v1.0, we should be able to think about such changes.
What's the status of this SLEP? Is it blocked by #22?
I don't think it is directly blocked by #22 (n_features_in_
) (there are a few overlapping aspects, eg regarding naming, but I think that is not touching the essential discussion here).
I think the status is that we should update the SLEP with the alternative being discussed above (or, in case that is preferred, write an alternative SLEP for it, but personally I think it would be good to describe the possible options in this SLEP).
I think the alternative which we mostly agree with, is the one proposed in https://github.com/scikit-learn/scikit-learn/pull/14315. We either need a new slep, or to update this one to reflect the the ideas there.
People can still rely on older sklearn if they really want to use old numpy.
Telling users that they can either use old versions of sklearn or upgrade everything makes it hard for users to have a buffer where they progressively do upgrades. In an ideal world, upgrades should be minor events that we do, like cleaning our room (I'm terrible at cleaning my room). The problem is when upgrades become large endeavors. First, we need to schedule significant time slots for them, second more things are likely to go wrong together. For instance: if upgrading scikit-learn triggers a significant upgrade in numpy, which itself triggers a significant upgrade in pandas, which itself comes with behavior changes. It's then harder for the user to audit the changes on his analysis or production tasks. The user is more likely to delay the upgrade, and the problem becomes worse.
Besides, if we're talking about v1.0, we should be able to think about such changes.
It's all a question of leaving the buffer: if v1.0 happens in two years, then yes.
I think that an ideal situation would be if we can gradually ramp up the requirements of scikit-learn. In other terms: if our requirements are versions that are old enough for users to have had time to adjust, typically because they are installed by the rolling upgrade of company policy (which is unfortunately something that varies widely).
@GaelVaroquaux My idea was to have a config option that explicitly enables the new behavior and that this config would fail with older numpy (i.e. the config option would have a soft dependency on numpy 1.17) I would definitely not trigger an update.
If you have an idea how to implement something like names arrays without numpy 1.17 I think we're all ear. The alternative would be to implement feature names via a different mechanism.
@hermidalc I agree that more metadata would be nice, but actually we haven't even solved this problem with **fit_parms
(see #16). Adding additional metadata later will probably be relatively straight-forward if/when we agree on a mechanism.
What do you do with the meta-data for PolynomialFeatures or PCA?
With much delay, I did a quick clean-up of the draft I wrote beginning of March at the end of the sprint. So here is an initial version of the SLEP on propagating feature names through pipelines.
The PR implementing it is https://github.com/scikit-learn/scikit-learn/pull/13307