scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

SLEP 014 Pandas in Pandas out #37

Closed thomasjpfan closed 1 year ago

thomasjpfan commented 4 years ago

This SLEP proposes pandas in pandas out as an alternative to SLEP 012 InputArray.

rth commented 4 years ago

I am a bit concerned in the sparse case, because that will mean a memory copy from my understanding, right?

It's much worse than a memory copy for sparse cf https://github.com/scikit-learn/enhancement_proposals/pull/37#issuecomment-596577388 (at least in high dimensionality)

rth commented 4 years ago

I wonder if Transform Input | Transform Output scipy.sparse | scipy.sparse might be more effective.

It could be, but then one couldn't get feature names for a simple pipeline of OneHotEncoder + LinearModel since OHE returns sparse data by default. Which is also likely not ideal.

amueller commented 4 years ago

@rth Yes, sorry, it's a conversion + copy + loads of overhead. Suggestions? ;)

NicolasHug commented 4 years ago

@NicolasHug what would you measure?

please see https://github.com/scikit-learn/enhancement_proposals/pull/37#issuecomment-598714355

@TomAugspurger laid out a zero-copy strategy

Everything is very hypothetical still, and I'd rather base my decision on facts and numbers than on hypothetical behaviors.

I'm not against calling for a vote, but I don't have enough info to make up my mind yet

TomAugspurger commented 4 years ago

Everything is very hypothetical still, and I'd rather base my decision on facts and numbers than on hypothetical behaviors.

Just to clarify, the behavior today is exactly what scikit-learn wants. For homogenous, numeric data, asarray(DataFrame(ndarray)) is zero copy:

In [10]: a = np.random.randn(100, 10)

In [11]: df = pd.DataFrame(a)

In [12]: b = np.asarray(df)

In [13]: b.base is a
Out[13]: True

The "hypothetical" bit is going to have to remain hypothetical, since it's uncertain whether that refactor will actually happen.

amueller commented 4 years ago

@NicolasHug

Everything is very hypothetical still, and I'd rather base my decision on facts and numbers than on hypothetical behaviors.

You suggest evaluating a worst-case bound based on a forced single memory copy. That gets you a worst-case bound for a very specific worst-case. If the worst-case bound is "not that bad" and the scenario is likely that would be indeed helpful. However, the situation for sparse is much worse than your worse-case scenario would imply. So an "even worse" case would be to have similar overhead for dense, which would be worse than what's covered by your benchmark. So your benchmark doesn't actually provide a bound, as it relies on a very specific hypothetical future ;)

ps: I'm not saying that I expect pandas to introduce that kind of overhead, my point is more that we're not gonna get very far without making assumptions.

NicolasHug commented 4 years ago

Sorry @amueller I don't completely follow: we can run the benchmarks for both dense and sparse data? I don't see yet how my comment describes a "very specific worst case".

What I'm trying to say is, I'd be happy to vote yes if the benchmarks show that the worst case scenario isn't that bad after all. (as long as we can simulate the actual worst case).

For now, IMHO, there is no tangible element on which to base a decision (hypothetical future behavior of pandas + hypothetical solution for that hypothetical behavior).

I'm basically following your own words:

I don't like making an argument using something for which I haven't seen measurements

But like I said, if others want to vote, let's vote.

amueller commented 4 years ago

@NicolasHug

We can certainly run a benchmark for what's there currently, but we also know the outcome: for dense data there is no overhead, and for sparse data there is unacceptably large overhead.

What you also proposed is a hypothetical future scenario in which there is a single copy in the case of dense data (and an additional copy in the case of sparse data?). What I tried to say is that at least theoretically the future could be even worse than that, and so this benchmark also makes assumptions about the future implementation.

amueller commented 4 years ago

I don't know what the current proposal is for sparse data so I would like to clarify that first, given the issues that @rth pointed out above.

jorisvandenbossche commented 4 years ago

[@GaelVaroquaux] memory copies really bother me. Memory usage is a weak point of our ecosystem. .. Is having 2 data structures in Pandas, with 2 different memory layout, one optimized for the relational algebra, which would induce memory copies, one optimized for "primal-space analytics", ie row-wise addressing, which would not induce memory copies?

Memory copies are also a big problem in pandas, and actually one of the reasons to go towards a column store in pandas (to avoid things like "consolidation" of same-dtype columns etc).

So while I personally am pushing for a column-store in pandas, I still think there is value in having a kind of homogeneously typed DataFrame backed by a single 2D array (we talked about this briefly in the call last week). So it would be a worthwile discussion to think about where in the PyData ecosystem that fits. Should people just use xarray for that use case? Or can we have such a single dtype DataFrame class in pandas that can share some infrastructure with the column-store DataFrame? That are interesting discussions, but let's have that discussion somewhere else ;)

But, as @amueller / @jnothman already argued, this is not really a concern right now, and as @TomAugspurger showed, it seems even possible to hack together a zero-copy roundtrip even if DataFrame would become a column-store.

(@jorisvandenbossche can confirm that you can create pyarrow arrays as a view on and ndarray with no missing values for primative types like int64?)

Yes, that is possible (even for floats with NaNs that is possible, if you keep them as NaNs and don't convert to missing vlaues).

I am a bit concerned in the sparse case, because that will mean a memory copy from my understanding, right?

It's much worse than a memory copy for sparse cf #37 (comment) (at least in high dimensionality)

@rth I just commented in https://github.com/pandas-dev/pandas/issues/32196 that this can be sped up rather easily. But of course, it will always stay a lot slower since it is creating many 1D sparse arrays from the 2D array. In principle, I think this can even be zero-copy as well (although it will still be slow).

rth commented 4 years ago

I just commented in pandas-dev/pandas#32196 that this can be sped up rather easily.

Cool, thanks @jorisvandenbossche ! That would indeed resolve most of the concerns with sparse DataFrame.

jorisvandenbossche commented 4 years ago

I am not sure that resolves most of the concerns regarding sparse. First, it's only possibly in a future release. And moreover, even if we can give that conversion a factor 10 speed-up in pandas, it is still much slower than a CSR->CSC conversion.

For sparse output of transformers, it seems still best to output scipy matrices, but not sure how to deal with features names then.

rth commented 4 years ago

Well as far as I know, there is no ideal good way currently to have sparse arrays with labels in general. The alternative with xarray is to use pydata/sparse. That requires numba/llvmlite (i.e. we need to add optional dependencies), and there are some performance concerns for basic operations https://github.com/pydata/sparse/issues/331 if users actually want to do anything with the returned objects.

So given that this is going to be an opt-in feature and only available in ~7 months in 0.24, as far as I understand, saying that you need latest pandas for best performance with sparse by that time could be reasonable IMO. I think we might want to compare to the cost on a typical workflow rather than to CSR->CSC conversion. Currently the performance blocker is only with a very large number of features (e.g. 100k). As mentioned in previous benchmark for n_samples=1M, n_features=100, it is 5x slower than CSR->CSC conversion but in the end only 200ms so likely not that noticeable with respect to the overall pipeline cost (particularly if we can make the conversion 10X faster).

rth commented 4 years ago

What I'm trying to say is, I'd be happy to vote yes if the benchmarks show that the worst case scenario isn't that bad after all.

This discussion shows that we currently don't have benchmarks for a few typical pipelines, and cannot easily evaluate the performance impact of proposed changes (in general, not specific to this PR). It would be good to merge ASV benchmarks created by @jeremiedbb into scikit-learn (https://github.com/scikit-learn/scikit-learn/issues/16723), use them as a reference and extend them as needed. I really like how this is done and documented in pandas. Not a blocker for this SLEP but would be nice to have.

amueller commented 4 years ago

I think we should ask ourselves whether we're ok with having a solution that doesn't address sparse. It would certainly mean not covering some important use-cases. The other option is to go back to having our own datatype. Or our own datatype for sparse. Or using xarray+pydata/sparse. hm...

adrinjalali commented 4 years ago

Is it not covering some usecases or is it having some non-trivial overhead when the data is sparse?

I think right now there's some overhead which we're not comfortable with, but we could see if pandas could at least improve on the conversions maybe the same way that it's done in pydata/sparse (https://github.com/pydata/sparse/pull/11). WDYT @TomAugspurger , @jorisvandenbossche ? Or do we already have this in pandas?

In the mean time, we could develop the solution using pandas, expecting to have these overheads for sparse, and by the time we're ready to release this feature, it could also be improved in pandas.

TomAugspurger commented 4 years ago

we could see if pandas could at least improve on the conversions maybe the same way that it's done in pydata/sparse

@jorisvandenbossche and @rth have been doing some work on pandas to get converting between a scipy.sparse matrix and pandas' format faster (hopefully they can report results). But even with this improvements we're still going to bump against the fact that pandas' SparseArray is 1-D, so when you have a sparse results with ~10,000s of columns, you're creating ~10,000s of Python objects, which is slow, even if creating each individual object is cheap. pandas' extension arrays being 1-D is a fundamental limitation that isn't likely to change soon.

So there does seem to be a choice between "pandas in / pandas out" and "fast sparse transformed results" which may warrant its own configuration.

thomasjpfan commented 4 years ago

I am currently prototyping this feature out, so we can measure what the overhead looks like for the various configurations combinations of xarray/pandas/sparse. I am planning to get this out this week.

amueller commented 4 years ago

@TomAugspurger I'm a bit hesitant to have two separate configuration options to have some form of feature names. Or are you saying if we enable pandas output we would also go to pandas for sparse, and then have another option to move to a different format? And what would that look like?

TomAugspurger commented 4 years ago

I just worry that because pandas' support for very wide sparse datasets is so poor, people will be surprised / annoyed at set_config(pandas_in_out=True) slowing down things returning spare results. I'm sure there's a bit more we can do on the pandas side to optimize constructing a DataFrame from a sparse object, but it'll always have substantial overhead over scipy as long as we're creating one 1-D array per column.

rth commented 4 years ago

it'll always have substantial overhead over scipy as long as we're creating one 1-D array per column.

Let's way for benchmark results from https://github.com/scikit-learn/scikit-learn/pull/16772 to decide. The question is not so much the performance difference between pandas sparse and scipy sparse for constructors, but how much this would affect real word pipelines. <10% overhead to get feature names might be OK, x10 in some use cases is definitely not.

people will be surprised / annoyed at set_config(pandas_in_out=True) slowing down things returning spare results.

I think returning some custom object or a pydata/sparse wrapped into an xarray would results in as much suprise / annoyance for users not familiar with either.

amueller commented 4 years ago

One thing that I didn't realize before [maybe it was discussed above and I just forgot?] is that deciding output type based on input type is not possible for CountVectorizer, which will be one of the main use-cases. @thomasjpfan implemented the 'we always output ' option, which produces array/dataframe/xarray depending on the configuration setting, independent of the input.

jnothman commented 4 years ago

What are the next steps here?

thomasjpfan commented 4 years ago

Given the developments in https://github.com/scikit-learn/scikit-learn/pull/16772, this would amount to a new SLEP which describes array_out, which is different from "pandas in pandas out". It should describe the limits that were discovered in https://github.com/scikit-learn/scikit-learn/pull/16772, i.e. sparse matrices are better with xarray, etc.

ageron commented 2 years ago

I understand that this SLEP is aimed at transformers, but why not consider predictors as well? Getting a Pandas DataFrame out is convenient, it clarifies the meaning of the output columns, and it preserves the index (row names).

For classifiers, we already have the classes_ attribute, so predict_proba() and decision_function() could return a DataFrame with these class names (or something like "proba(<classname>)" or "decision_function(<classname>)" if in verbose mode? And predict() would just have "class" (or "predicted_class") as the feature name.

For regressors, the output feature names could be the same as the input target names. This would require storing the target names when calling fit(), e.g., in target_names_ (or target_feature_names_).

Even when a method has a single output feature, the benefits are still 1. convenience, 2. clarity, and 3. preserving the index:

amueller commented 1 year ago

This should be moved to rejected as we accepted SLEP 18, right?

thomasjpfan commented 1 year ago

In 653e476 (#37), this SLEP is now rejected.

adrinjalali commented 1 year ago

Let's merge this then.