Closed adrinjalali closed 4 years ago
should this slep stick to feature names, or should it discuss other meta-data
I think it should give examples of other potential things we might include.
FeatureArray?
Scope 1) I would say we leave the route open to include feature props in the future.
DataArray is the name of the xarray class, so I'd say -1 on that. FeatureArray seems better, but if we at some point in the future want to include sample weights or other sample props, it's going to be wrong.
We could do something non-informative like SklArray
. InputArray? InfoArray? ugh...
Scope 2) I agree with @jnothman, you should give examples that we might care about in the future.
Implementation details: Yes, it should include implementation details and alternatives.
I want to record something that we discussed in person yesterday, which is enabling simple conversion to pandas dataframes. Ideally users should be able to do pd.DataFrame(our_thing)
to get a dataframe (using something like the __array__
protocol for numpy). For example we'd like seaborn to work natively on our new data structure.
It's not entirely clear whether either of these goals are achievable in a non-hacky way right now, but I think they are important considerations.
A less elegant solution would be a toframe()
method, but that probably wouldn't allow us to integrate with existing tools like seaborn.
@jorisvandenbossche says nothing like the __array__
protocol exists for pandas.
For example we'd like seaborn to work natively on our new data structure.
Then seaborn would still need to convert inputs to DataFrames, and not assume it already is (didn't check seaborn's inernals). So even if there is such a pandas protocol, all downstream libraries that accept dataframes might need to be adapted before natively working on such a new data structure (not impossible of course, and one needs to start with it at some point, but certainly an extra hurdle).
Indeed, the downstream libraries would need work. But this would be an enabling feature that allows downstream libraries to implement this conversion without knowing anything about the input format. Seaborn has some isinstance(data, pd.DataFrame)
which would require adjustment. There's also some hasattr(x, 'shape')
.
Hi, I found this very interesting. I'm not sure, my comment will be relevant, but what about simple integration of numpy structured arrays ?
Also an integration with Pipelines would be interesting, like FeatureUnion (retaining feature names).
Struct arrays would meet some but not all of our needs, yet are quite tricky to work with, e.g. very limited support for object dtype and impossible to offer backwards compatibility (we would be dealing with 1d struct arrays rather than 2d arrays).
Dear core devs I admire your hard work to get feature names in scikit-learn done. This is a huge step forward for usability. As this isn't the first proposal, you already put a lot of thought into it. I don't know were to put these, but here are a few thoughts from a user:
There are already several data structures in Python, to name a few: numpy.ndarray, pandas.DataFrame, xarray.DataArray, pyarrow.Table, datatable.Frame (h2oai), ... I worry about fragmentation and interoperability issues between data structures, as they form the very foundation of any process (ETL, ML, visualization, ...) on top if it.
Algorithms in scikit-learn are implemented on numpy.ndarray (homogeneous n-dimensional data). But usually, my data comes as heterogeneous tabular data (strings, categorical, floats). How deep should the integration between those two go? ColumnTransformer is already a great help.
R settled on its data.frame structure (or data.table or tibble, which are interoperable). This enables data.frame-in and data.frame-out processes on which many libraries and functions rely, especially in the tidyverse. R formulae are another means to work directly on dataframes instead of matrices/arrays.
At least in my work, I usually start with a tabular data structure, something like this.
import pandas as pd
df = pd.read_parquet("myfile.parquet")
After some feature preprocessing/engineering, I need to pass them to ColumnTransformer, before I can use an estimator:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.X import XRegressor
y = df['target']
X = df[:, ['feature_a', 'feature_b', ...]]
col_trans = ColumnTransformer(
[('feature_a_ohe', OneHotEncoder(),['feature_a']), ...])
model = Pipeline(steps=[('col_trans', col_trans), ('regressor', XRegressor())])
model.fit(X, y)
After that, I'd like inspect my model with respect to the original input features ('feature_a', 'feature_b' —the ones before the ColumnTransformer).
Thanks for your input @lorentzenchr. These concerns are definitely on our mind. Let me briefly reply with my thoughts
1) Yes this is a big concern, but we have not found a better solution.
2) That's a good question. I think @jorisvandenbossche suggested working on pandas dataframes where possible, though this is a tiny fraction of sklearn (some of the preprocessing module, maybe feature selection?). I'd be in favor of this but the reach will necessarily be quite limited.
3) R does a memory copy for every input and output as far as I know. We opted not to do that. That was one of the strongest arguments against using pandas. Maybe having memory copies are worth it if usability improvements make it worth it? Though if pandas implemented https://github.com/pandas-dev/pandas/issues/30218 we could integrate with other libraries easily without having to force a memory copy on our end.
4) yes that's exactly what I want.
Do you want another review or do you want to address the comments first? I can also send a PR.
• row alignment does take time, and we have discussed that it's something we definitely don't want. @GaelVaroquaux may have a clearer argument on this one
Not only does it take time, it is in practice a significant source of bugs for naive users who do not know that it is happening.
@amueller I tried to address the comments. I'm happy to take more comments, or for you to either change directly this PR or a PR to my fork if you think it needs discussion and things are not clear.
I also update the NamedArray PR I had with the lates prototype I had for sparse arrays (https://github.com/scikit-learn/scikit-learn/pull/14315)
Can someone expand on the row alignment issue? I don’t see any case where row labels would be used by an estimator.
Adrin attempted an explanation to my "row-label alignment" confusion. The
concern may not be internal to scikit-learn, which I think will disregard row
labels. Rather, it was for users who might receive this named array from a
.transform
. If they're unfamiliar with pandas / xarray, they'll be surprised
by the alignment. In code,
>>> trn = StandardScaler()
>>> a = xr.DataArray(np.random.randn(3, 2), coords={"dim_0": ['a', 'b', 'c']})
>>> b = np.random.randn(3, 2)
>>> trn.fit(a)
>>> trn.transform(a) + trn.tranform(b) # ? + ?, does this align?
This comes down to a fundamental question: what is the output type from
.transform
? I worry that we're conflating two things: Some data structure for
scikit-learn to use internally for propagating feature names, and some data
structure for users to provide feature names to scikit-learn, and use after
getting a result back from scikit-learn (via transform
).
My initial preferences are for
.columns
for a DataFrame) match estimator.output_features_names_
(may have the attribute name wrong)
b. (possibly) Row labels, if any, match the input row labelsApplying those principles gives the following table
Fit Input | Transform Input | Transform Output |
---|---|---|
ndarray | ndarray | ndarray |
ndarray | DataArray | DataArray |
ndarray | DataFrame | DataFrame |
DataArray | ndarray | ndarray |
DataArray | DataArray | DataArray |
DataArray | ndarray | ndarray |
DataFrame | DataFrame | DataFrame |
DataFrame | DataArray | DataArray |
DataFrame | DataFrame | DataFrame |
Some objections to this approach are that
pd.DataFrame(np.asarray(df))
have two memory
copies. But that refactor is uncertain, at is some years off at a minimum.@TomAugspurger what does "internally" mean here? An estimator doesn't know whether it's in a pipeline or not, and so if you'd pass a dataframe into a pipeline, you'd convert between dataframe and numpy array (which is used within the estimator) between every step. Previously I think the agreement was not to pay this cost. I'm not 100% sure any more, though. That would only be a cost if there was this refactor, though.
There's one more concern with your proposal: it's not backward compatible.
That could be solved by a global flag sklearn.set_config(preserve_frames=True)
or something like that.
what does "internally" mean here?
Essentially, whatever check_array(X)
returns. The array / matrix scikit-learn actually works with inside .fit
and .transform
.
That may or may not be a copy for a homogenous DataFrame today.
it's not backward compatible.
Right. If you're making a new InputArray you wouldn't have the legacy behavior concerns.
@TomAugspurger ok, I think so far anything we discussed used ndarray/scipy sparse internally, and we were only discussing input/output format.
Should we do a big list of the pros and cons of the three solutions we considered? One is pandas-in, pandas-out, one is DataArray, and one is passing meta-data through the pipeline and meta-estimators. This discussion is really been pretty gnarly. This could be in the alternatives section of this SLEP or in a separate document. We don't really have a framework for making a decision between three options (other than discussion).
@adrinjalali @GaelVaroquaux can you maybe recap the arguments against pandas-in pandas-out? I used to be convinced by the memory copy argument but honestly I think the usage advantage outweights that, given that it's a potential future concern and that we make so many copies already (and that they can be easily avoided by the user by converting to numpy if needed).
there were/are some arguments against pandas/xarray, I'll try to recap:
On the plus side, once I did have an implementation of the feature names with xarray and it was working. When it came down to it, it was the operation semantics, pandas dependency, and row alignment which resulted in us working on a different data structure (IIRC).
Can you elaborate on row alignment and what the issue is?
And why are different semantics a problem? It means an incompatible change but when users provide pandas dataframes as inputs, I would suppose they are familiar with pandas semantics. One could argue that transformers currently changing the semantics from pandas to numpy is an issue ;) - certainly having a breaking change with a config flag is some annoyance, though.
pandas in pandas out would kinda imply supporting pandas which in turn could mean understanding multi-indices (among other pandas features).
I'm not sure what you mean. We are currently supporting pandas as input types quite explicitly.
The API point seems to be a repeat of the different semantics point, right?
The reason I didn't like xarray was the axis semantics which is not really natural for us.
I guess writing this down well requires a different list of pro/cons for xarray and pandas each.
The point that @TomAugspurger made was more about preserving data types than using one or the other. From a user perspective that makes sense, since we'd want users to work with the API that they are familiar with. Having a pandas user work with xarray might not be super familiar (though xarray users are probably familiar with pandas to some degree?).
If we would want to preserve both types, that would probably require some custom code, and more code for each additional type we'd want to support (if any). As @TomAugspurger alluded to, we could try to define an abstraction for that, but that would be pretty far future.
I'd be happy to just preserve type for pandas for now.
Meaning there's actually 5 solutions: a) preserve type (from a white-list of types which could start with just pd.DataFrame) b) always use pd.Dataframe c) always use xarray.DataArray d) use DataArray e) pass information through in meta-estimators.
The semantic differences would be an argument against b) and c) but not against a) imho.
Apologies for muddling the DataArray discussion with the "pandas (or xarray) in / pandas (or xarray) out" discussion. But I think they're related, since the need for a custom DataArray is lessened if pandas in / pandas out becomes the default behavior, and if feature names are limited to arrays that already have names.
row alignment (pandas, xarray) which in turn could mean understanding multi-indices (among other pandas features),
Again, I really think that row-labels are a non-issue for scikit-learn internally :) And as for users, if they're passing in a DataFrame then they probably are familiar with pandas' alignment. The extent of scikit-learn's interaction with row labels would be something like
def transform(self, X, y=None):
X, row_labels, input_type = check_array(X)
# rest of tranform, operating on an ndarray
result = ...
# some hypothetical function that recreates a DataFrame / DataArray,
# preserving row labels, attaching new features names.
result = construct_result(result, row_labels, feature_name, input_type)
return result
If the issue with multi-indices are for the columns, then I'd say scikit-learn shouldn't try to support those. I think / hope that you have a requirement that feature_names_in_
and feature_name_out_
be a sequence of strings.
If both sample_weight and X are indexed, should sklearn try to align them in operations?
@adrinjalali that's a good question. Right now, if X
, y
and sample_weight
are indexed, we just drop the index. Given that we have explicit support for pandas in ColumnTransformer
that could be considered a bug.
For columns we are enforcing that they must be exactly the same between fit and transform. I could see us taking a similar stance here, i.e. asserting that the index is the same for X
, y
, sample_weights
and raise an error if not.
Given that we decided against aligning columns, I think it makes sense to also not align rows. This issues is already present in the code base right now, though. I don't really follow why you think it relates to changing the output type.
I kinda get the feeling that if:
we can go back to using xarray
as a return type for transformers if the user provides feature names through either an xarray or a pandas data structure?
We could have a global conf to enable/disable the behavior, and have it disabled by default for a few releases.
I agree with @adrinjalali only that I would actually use pandas, not xarray. I think having less dependencies and types is better. The main reason I had for xarray vs pandas was the zero-copy issue, and I come to give that less weight (given that the future of pandas is unclear and that we copy a lot anyway).
Sorry for going back and forth on this. Main question: should we have a separate slep for using an existing "array" type (pandas it not really an array, I know). That will make voting and discussion harder and harder. I wonder if maybe discussing this on a call would be easier so we can sync up? I would love to hear @GaelVaroquaux's, @jnothman's and @jorisvandenbossche's take.
my gut feeling is that if we implement the machinery to have this optional, we could have it in the global config to be set as either xarray or pandas, and it shouldn't be too much work.
Oh, and on the SLEP issue, I'd say we should work on a slep with an existing data type, and if we get the feeling that there's some consensus there, I'd withdraw this slep (I wish it was merged :D )
I'm happy to merge it as it is. And doing pandas or xarray is substantially more work I think. Though I guess if we could compartmentalize it into some helper functions for wrapping and unwrapping it might not be so bad.
And doing pandas or xarray is substantially more work I think. Though I guess if we could compartmentalize it into some helper functions for wrapping and unwrapping it might not be so bad.
I'd be happy to have a prototype implementation once the n_features_in_
is in. I'm almost tempted to base my PR on @NicolasHug 's implementation and start before it's merged.
maybe coordinate with @thomasjpfan who I think is working on a new SLEP.
@amueller would you like to merge? :D
Sure. I think the proposal is reasonably complete, even though I don't expect a vote on it in the current form very soon.
Can someone remind me where we had the discussion that concluded with making the output type globally constant instead of it depending on the input type?
I don't remember where we had the discussion, but I thought we agreed that not knowing what the output is by looking at a piece of code is bad practice?
Also, that policy means the same code which before would take a pd.DataFrame
and return an ndarray
now would return a pd.DataFrame
, which is confusing and not backward compatible (which we could argue is not essential for v1.0 ;) )
So I think I'd really prefer not to depend the output type on the input type.
This SLEP discusses a new data structure which would carry some meta-data along the data itself. The feature names are the first set of meta-data we would implement, but it could grow from that point.
Things we need to decide to continue the write up of this SLEP:
NamedArray
with only feature names may be another SLEP?NamedArray
, but we seem to lean towards including other meta-data than feature names. Therefore we probably should use another name.DataArray
is a placeholder. Happy to change.