scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
48 stars 34 forks source link

SLEP012 - DataArray #25

Closed adrinjalali closed 4 years ago

adrinjalali commented 4 years ago

This SLEP discusses a new data structure which would carry some meta-data along the data itself. The feature names are the first set of meta-data we would implement, but it could grow from that point.

Things we need to decide to continue the write up of this SLEP:

jnothman commented 4 years ago

should this slep stick to feature names, or should it discuss other meta-data

I think it should give examples of other potential things we might include.

jnothman commented 4 years ago

FeatureArray?

amueller commented 4 years ago

Scope 1) I would say we leave the route open to include feature props in the future.

DataArray is the name of the xarray class, so I'd say -1 on that. FeatureArray seems better, but if we at some point in the future want to include sample weights or other sample props, it's going to be wrong.

We could do something non-informative like SklArray. InputArray? InfoArray? ugh...

Scope 2) I agree with @jnothman, you should give examples that we might care about in the future.

Implementation details: Yes, it should include implementation details and alternatives.

amueller commented 4 years ago

I want to record something that we discussed in person yesterday, which is enabling simple conversion to pandas dataframes. Ideally users should be able to do pd.DataFrame(our_thing) to get a dataframe (using something like the __array__ protocol for numpy). For example we'd like seaborn to work natively on our new data structure. It's not entirely clear whether either of these goals are achievable in a non-hacky way right now, but I think they are important considerations.

A less elegant solution would be a toframe() method, but that probably wouldn't allow us to integrate with existing tools like seaborn.

amueller commented 4 years ago

@jorisvandenbossche says nothing like the __array__ protocol exists for pandas.

jorisvandenbossche commented 4 years ago

For example we'd like seaborn to work natively on our new data structure.

Then seaborn would still need to convert inputs to DataFrames, and not assume it already is (didn't check seaborn's inernals). So even if there is such a pandas protocol, all downstream libraries that accept dataframes might need to be adapted before natively working on such a new data structure (not impossible of course, and one needs to start with it at some point, but certainly an extra hurdle).

amueller commented 4 years ago

Indeed, the downstream libraries would need work. But this would be an enabling feature that allows downstream libraries to implement this conversion without knowing anything about the input format. Seaborn has some isinstance(data, pd.DataFrame) which would require adjustment. There's also some hasattr(x, 'shape').

alexgarel commented 4 years ago

Hi, I found this very interesting. I'm not sure, my comment will be relevant, but what about simple integration of numpy structured arrays ?

Also an integration with Pipelines would be interesting, like FeatureUnion (retaining feature names).

jnothman commented 4 years ago

Struct arrays would meet some but not all of our needs, yet are quite tricky to work with, e.g. very limited support for object dtype and impossible to offer backwards compatibility (we would be dealing with 1d struct arrays rather than 2d arrays).

lorentzenchr commented 4 years ago

Dear core devs I admire your hard work to get feature names in scikit-learn done. This is a huge step forward for usability. As this isn't the first proposal, you already put a lot of thought into it. I don't know were to put these, but here are a few thoughts from a user:

1. Yet another data structure

There are already several data structures in Python, to name a few: numpy.ndarray, pandas.DataFrame, xarray.DataArray, pyarrow.Table, datatable.Frame (h2oai), ... I worry about fragmentation and interoperability issues between data structures, as they form the very foundation of any process (ETL, ML, visualization, ...) on top if it.

2. User interface and underlying data structure

Algorithms in scikit-learn are implemented on numpy.ndarray (homogeneous n-dimensional data). But usually, my data comes as heterogeneous tabular data (strings, categorical, floats). How deep should the integration between those two go? ColumnTransformer is already a great help.

3. Compare to/Learn from R

R settled on its data.frame structure (or data.table or tibble, which are interoperable). This enables data.frame-in and data.frame-out processes on which many libraries and functions rely, especially in the tidyverse. R formulae are another means to work directly on dataframes instead of matrices/arrays.

4. Standard use case

At least in my work, I usually start with a tabular data structure, something like this.

import pandas as pd
df = pd.read_parquet("myfile.parquet")

After some feature preprocessing/engineering, I need to pass them to ColumnTransformer, before I can use an estimator:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.X import XRegressor
y = df['target']
X = df[:, ['feature_a', 'feature_b', ...]]
col_trans = ColumnTransformer(
    [('feature_a_ohe', OneHotEncoder(),['feature_a']), ...])
model = Pipeline(steps=[('col_trans', col_trans), ('regressor', XRegressor())])
model.fit(X, y)

After that, I'd like inspect my model with respect to the original input features ('feature_a', 'feature_b' —the ones before the ColumnTransformer).

amueller commented 4 years ago

Thanks for your input @lorentzenchr. These concerns are definitely on our mind. Let me briefly reply with my thoughts

1) Yes this is a big concern, but we have not found a better solution.

2) That's a good question. I think @jorisvandenbossche suggested working on pandas dataframes where possible, though this is a tiny fraction of sklearn (some of the preprocessing module, maybe feature selection?). I'd be in favor of this but the reach will necessarily be quite limited.

3) R does a memory copy for every input and output as far as I know. We opted not to do that. That was one of the strongest arguments against using pandas. Maybe having memory copies are worth it if usability improvements make it worth it? Though if pandas implemented https://github.com/pandas-dev/pandas/issues/30218 we could integrate with other libraries easily without having to force a memory copy on our end.

4) yes that's exactly what I want.

amueller commented 4 years ago

Do you want another review or do you want to address the comments first? I can also send a PR.

GaelVaroquaux commented 4 years ago

• row alignment does take time, and we have discussed that it's something we definitely don't want. @GaelVaroquaux may have a clearer argument on this one

Not only does it take time, it is in practice a significant source of bugs for naive users who do not know that it is happening.

adrinjalali commented 4 years ago

@amueller I tried to address the comments. I'm happy to take more comments, or for you to either change directly this PR or a PR to my fork if you think it needs discussion and things are not clear.

I also update the NamedArray PR I had with the lates prototype I had for sparse arrays (https://github.com/scikit-learn/scikit-learn/pull/14315)

TomAugspurger commented 4 years ago

Can someone expand on the row alignment issue? I don’t see any case where row labels would be used by an estimator.

TomAugspurger commented 4 years ago

Adrin attempted an explanation to my "row-label alignment" confusion. The concern may not be internal to scikit-learn, which I think will disregard row labels. Rather, it was for users who might receive this named array from a .transform. If they're unfamiliar with pandas / xarray, they'll be surprised by the alignment. In code,

>>> trn = StandardScaler()
>>> a = xr.DataArray(np.random.randn(3, 2), coords={"dim_0": ['a', 'b', 'c']})
>>> b = np.random.randn(3, 2)
>>> trn.fit(a)
>>> trn.transform(a) + trn.tranform(b)  # ? + ?, does this align?

This comes down to a fundamental question: what is the output type from .transform? I worry that we're conflating two things: Some data structure for scikit-learn to use internally for propagating feature names, and some data structure for users to provide feature names to scikit-learn, and use after getting a result back from scikit-learn (via transform).

My initial preferences are for

  1. Scikit-Learn to continue only using NumPy ndarrays / scipy sparse matricies internally. Arguments are converted to ndarrays / matricies explicitly.
  2. Estimator.transform to return an object with the same type as the input a. Feature names (.columns for a DataFrame) match estimator.output_features_names_ (may have the attribute name wrong) b. (possibly) Row labels, if any, match the input row labels

Applying those principles gives the following table

Fit Input Transform Input Transform Output
ndarray ndarray ndarray
ndarray DataArray DataArray
ndarray DataFrame DataFrame
DataArray ndarray ndarray
DataArray DataArray DataArray
DataArray ndarray ndarray
DataFrame DataFrame DataFrame
DataFrame DataArray DataArray
DataFrame DataFrame DataFrame

Some objections to this approach are that

  1. Feature names are only available to people using xarray, pandas. IMO that's acceptable. Adding a new data container to the ecosystem should have a high bar. Scikit-Learn coudl define an interface for other containers to met.
  2. A potential pandas refactor might make pd.DataFrame(np.asarray(df)) have two memory copies. But that refactor is uncertain, at is some years off at a minimum.
amueller commented 4 years ago

@TomAugspurger what does "internally" mean here? An estimator doesn't know whether it's in a pipeline or not, and so if you'd pass a dataframe into a pipeline, you'd convert between dataframe and numpy array (which is used within the estimator) between every step. Previously I think the agreement was not to pay this cost. I'm not 100% sure any more, though. That would only be a cost if there was this refactor, though.

amueller commented 4 years ago

There's one more concern with your proposal: it's not backward compatible. That could be solved by a global flag sklearn.set_config(preserve_frames=True) or something like that.

TomAugspurger commented 4 years ago

what does "internally" mean here?

Essentially, whatever check_array(X) returns. The array / matrix scikit-learn actually works with inside .fit and .transform.

That may or may not be a copy for a homogenous DataFrame today.

it's not backward compatible.

Right. If you're making a new InputArray you wouldn't have the legacy behavior concerns.

amueller commented 4 years ago

@TomAugspurger ok, I think so far anything we discussed used ndarray/scipy sparse internally, and we were only discussing input/output format.

amueller commented 4 years ago

Should we do a big list of the pros and cons of the three solutions we considered? One is pandas-in, pandas-out, one is DataArray, and one is passing meta-data through the pipeline and meta-estimators. This discussion is really been pretty gnarly. This could be in the alternatives section of this SLEP or in a separate document. We don't really have a framework for making a decision between three options (other than discussion).

amueller commented 4 years ago

@adrinjalali @GaelVaroquaux can you maybe recap the arguments against pandas-in pandas-out? I used to be convinced by the memory copy argument but honestly I think the usage advantage outweights that, given that it's a potential future concern and that we make so many copies already (and that they can be easily avoided by the user by converting to numpy if needed).

adrinjalali commented 4 years ago

there were/are some arguments against pandas/xarray, I'll try to recap:

On the plus side, once I did have an implementation of the feature names with xarray and it was working. When it came down to it, it was the operation semantics, pandas dependency, and row alignment which resulted in us working on a different data structure (IIRC).

amueller commented 4 years ago

Can you elaborate on row alignment and what the issue is?

And why are different semantics a problem? It means an incompatible change but when users provide pandas dataframes as inputs, I would suppose they are familiar with pandas semantics. One could argue that transformers currently changing the semantics from pandas to numpy is an issue ;) - certainly having a breaking change with a config flag is some annoyance, though.

pandas in pandas out would kinda imply supporting pandas which in turn could mean understanding multi-indices (among other pandas features).

I'm not sure what you mean. We are currently supporting pandas as input types quite explicitly.

The API point seems to be a repeat of the different semantics point, right?

The reason I didn't like xarray was the axis semantics which is not really natural for us.

I guess writing this down well requires a different list of pro/cons for xarray and pandas each.

The point that @TomAugspurger made was more about preserving data types than using one or the other. From a user perspective that makes sense, since we'd want users to work with the API that they are familiar with. Having a pandas user work with xarray might not be super familiar (though xarray users are probably familiar with pandas to some degree?).

If we would want to preserve both types, that would probably require some custom code, and more code for each additional type we'd want to support (if any). As @TomAugspurger alluded to, we could try to define an abstraction for that, but that would be pretty far future.

I'd be happy to just preserve type for pandas for now.

Meaning there's actually 5 solutions: a) preserve type (from a white-list of types which could start with just pd.DataFrame) b) always use pd.Dataframe c) always use xarray.DataArray d) use DataArray e) pass information through in meta-estimators.

The semantic differences would be an argument against b) and c) but not against a) imho.

TomAugspurger commented 4 years ago

Apologies for muddling the DataArray discussion with the "pandas (or xarray) in / pandas (or xarray) out" discussion. But I think they're related, since the need for a custom DataArray is lessened if pandas in / pandas out becomes the default behavior, and if feature names are limited to arrays that already have names.

row alignment (pandas, xarray) which in turn could mean understanding multi-indices (among other pandas features),

Again, I really think that row-labels are a non-issue for scikit-learn internally :) And as for users, if they're passing in a DataFrame then they probably are familiar with pandas' alignment. The extent of scikit-learn's interaction with row labels would be something like

def transform(self, X, y=None):
    X, row_labels, input_type = check_array(X)
    # rest of tranform, operating on an ndarray
    result = ...
    # some hypothetical function that recreates a DataFrame / DataArray,
    # preserving row labels, attaching new features names.
    result = construct_result(result, row_labels, feature_name, input_type)
    return result   

If the issue with multi-indices are for the columns, then I'd say scikit-learn shouldn't try to support those. I think / hope that you have a requirement that feature_names_in_ and feature_name_out_ be a sequence of strings.

adrinjalali commented 4 years ago

If both sample_weight and X are indexed, should sklearn try to align them in operations?

amueller commented 4 years ago

@adrinjalali that's a good question. Right now, if X, y and sample_weight are indexed, we just drop the index. Given that we have explicit support for pandas in ColumnTransformer that could be considered a bug.

For columns we are enforcing that they must be exactly the same between fit and transform. I could see us taking a similar stance here, i.e. asserting that the index is the same for X, y, sample_weights and raise an error if not.

Given that we decided against aligning columns, I think it makes sense to also not align rows. This issues is already present in the code base right now, though. I don't really follow why you think it relates to changing the output type.

adrinjalali commented 4 years ago

I kinda get the feeling that if:

we can go back to using xarray as a return type for transformers if the user provides feature names through either an xarray or a pandas data structure?

We could have a global conf to enable/disable the behavior, and have it disabled by default for a few releases.

amueller commented 4 years ago

I agree with @adrinjalali only that I would actually use pandas, not xarray. I think having less dependencies and types is better. The main reason I had for xarray vs pandas was the zero-copy issue, and I come to give that less weight (given that the future of pandas is unclear and that we copy a lot anyway).

Sorry for going back and forth on this. Main question: should we have a separate slep for using an existing "array" type (pandas it not really an array, I know). That will make voting and discussion harder and harder. I wonder if maybe discussing this on a call would be easier so we can sync up? I would love to hear @GaelVaroquaux's, @jnothman's and @jorisvandenbossche's take.

adrinjalali commented 4 years ago

my gut feeling is that if we implement the machinery to have this optional, we could have it in the global config to be set as either xarray or pandas, and it shouldn't be too much work.

adrinjalali commented 4 years ago

Oh, and on the SLEP issue, I'd say we should work on a slep with an existing data type, and if we get the feeling that there's some consensus there, I'd withdraw this slep (I wish it was merged :D )

amueller commented 4 years ago

I'm happy to merge it as it is. And doing pandas or xarray is substantially more work I think. Though I guess if we could compartmentalize it into some helper functions for wrapping and unwrapping it might not be so bad.

adrinjalali commented 4 years ago

And doing pandas or xarray is substantially more work I think. Though I guess if we could compartmentalize it into some helper functions for wrapping and unwrapping it might not be so bad.

I'd be happy to have a prototype implementation once the n_features_in_ is in. I'm almost tempted to base my PR on @NicolasHug 's implementation and start before it's merged.

amueller commented 4 years ago

maybe coordinate with @thomasjpfan who I think is working on a new SLEP.

adrinjalali commented 4 years ago

@amueller would you like to merge? :D

amueller commented 4 years ago

Sure. I think the proposal is reasonably complete, even though I don't expect a vote on it in the current form very soon.

amueller commented 4 years ago

Can someone remind me where we had the discussion that concluded with making the output type globally constant instead of it depending on the input type?

adrinjalali commented 4 years ago

I don't remember where we had the discussion, but I thought we agreed that not knowing what the output is by looking at a piece of code is bad practice?

Also, that policy means the same code which before would take a pd.DataFrame and return an ndarray now would return a pd.DataFrame, which is confusing and not backward compatible (which we could argue is not essential for v1.0 ;) )

So I think I'd really prefer not to depend the output type on the input type.