SLEP 014 Pandas in Pandas out

thomasjpfan commented 4 years ago

This SLEP proposes pandas in pandas out as an alternative to SLEP 012 InputArray.

NicolasHug commented 4 years ago

I think this should be slep 14

adrinjalali commented 4 years ago

Another point is to talk abut pandas being a soft dependency in the SLEP, I guess; I'm not sure if we need to talk about how.

TomAugspurger commented 4 years ago

One of the reasons I think we should consider xarray, is that we can attach arbitrary feature/sample/data props to the data

pandas 1.0 added a DataFrame.attrs dictionary that behaves the same as DataArray.attrs, so xarray and pandas should be the same in this regard.

and we may be able to convince them to remove a hard pandas dependency (correct me if I'm wrong @TomAugspurger)

In theory, I think it's possible to have a Variable (which backs a DataArray) without any coordinates (and so no need for a pandas Index to store the labels). I don't have a good idea of how open the xarray devs are to extracting that / making it possible to use xarray without pandas.

thomasjpfan commented 4 years ago

Should we consider the new pandas nan in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.

The pd.Na feature is consider experimental. As stated in their docs:

Experimental: the behaviour of pd.NA can still change without warning.

It also seems like DataFrame.attrs is experimental as well.

In both cases, I think we should wait for the features to stabilize before designing with them in mind.

jorisvandenbossche commented 4 years ago

Should we consider the new pandas nan in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.

I personally think it's interesting that scikit-learn would consider supporting it, but IMO it is somewhat orthogonal to this SLEP. Also in the current situation of converting DataFrames to arrays on input to estimators, the question of supporting pd.NA already comes up.

amueller commented 4 years ago

I think the scope is only transform. If that's not clear from the SLEP it needs to be clarified. Or are you saying it should also be predict etc? I'm not sure how I feel about that ;)

jorisvandenbossche commented 4 years ago

I think transforms is the most important to start with, so that seems the best scope. But yes, so I mainly wanted to indicate that this should be mentioned more clearly in the SLEP then ;)

amueller commented 4 years ago

Do we want to have an in-depth discussion of xarray vs pandas in this slep? In this issue? In #35?

My brief opinion:

pandas pros

allows heterogeneous data types
is the natural format for tabular data (categorical dtypes, lots of io stuff, widely used)
pandas cons
potential memory copy for round-trip in the future

xarray pros

guaranteed no memory copy for round-trip
xarray cons
homogeneous data only
we need to come up with names for the axes and users need to adhere to them
adds one additional dependency

I assume @adrinjalali's pro and con list is different from mine since his preference is xarray ;)

glemaitre commented 4 years ago

xarray cons

homogeneous data only

Isn't it possible to have it with an xarray Dataset?

amueller commented 4 years ago

That's true. I thought we were considering DataArray not DataSet. If we're using DataSet we have to basically reimplement the pandas block manager, right?

glemaitre commented 4 years ago

If we're using DataSet we have to basically reimplement the pandas block manager, right?

I don't know enough about it but it seems that you are right.

amueller commented 4 years ago

I also don't know much about it, that was an educated guess based on the docstring ;)

glemaitre commented 4 years ago

Same educated guest then :)

Sent from my phone - sorry to be brief and potential misspell.

jnothman commented 4 years ago

Is the copying issue in pandas a real one? I would expect that we could petition pandas to make the data F-contiguous and export a non-copied values array, in the case that the frame is being constructed anew with a homogeneous dtype. This would be the case for almost all of our transformer outputs.

Do we need to choose One format? It seems that a big challenge in this feature altogether is trying to support different formats (non ndarray) and their proliferation. Should we be providing for adapters to support different types?

PS: I've not read much of the above

amueller commented 4 years ago

Is the copying issue in pandas a real one?

It's unclear to me at least. I talked to @WillAyd a couple of days ago and he seem to think a zero-copy round-trip is feasible even when moving to a column store.

Do we need to choose One format? We don't need to choose "One Format" but for each input type, we need to choose one output format, also see https://github.com/scikit-learn/enhancement_proposals/pull/25#issuecomment-571640817.

And I guess then the question is which ones should we implement.

Should we be providing for adapters to support different types? Are adapter meta-estimators? I think that would introduce too much complexity. They could be a stop-gap but I'd rather have a better solution.

So far I don't think any of the proposed duck-array protocols would help us here. If we want to support arbitrary types, we'd need a protocol that basically allows us to separate the numpy array from the meta-data and then recreate a new duck array with a new numpy array and the old meta-data. I'm not entirely sure in how far that's possible, and it seems a little bit overkill to me for now.

jorisvandenbossche commented 4 years ago

a zero-copy round-trip is feasible even when moving to a column store.

If you want a 2D numpy array, then a zero-copy roundtrip is not possible with a column store AFAIK (it needs to combine multiple 1D arrays into one 2D array, which always requires a copy?).

amueller commented 4 years ago

I'm not saying that it's a good idea, but in theory one could find out that the columns are sequential in memory and make a view of them together, right? (you'd probably also keep around the separate views so no-one deallocates it)

import numpy as np
X = np.random.normal(size=(100, 3)).copy("F")

asdf = {'a': X[:, 0], 'b': X[:, 1], 'c': X[:, 2]}

asdf['b'].__array_interface__['data'][0] == asdf['a'].__array_interface__['data'][0] + asdf['a'].itemsize * asdf['a'].size

True

lorentzenchr commented 4 years ago

Another 5 cents to this discussion.

In my view, there are 2 important design choices for a ML library concerning data structures:

Internal data structure: What kind of object is passed inside a pipeline, from step to step? So far, this is a numpy array, a homogeneous data structure.
Connectivity to other data structures at the very beginning of a pipeline. So far, you can pass a pandas dataframe (heterogeneous), e.g. to a column transformer. In the first step, it is converted to the internal data structure.

This SLEP is about the first point, the internal data structure, and therefore touches the very foundation and goes far beyond feature names (at least in this pandas in pandas out version). This brings me to the following (provocative¹) questions:

If all estimators swallow (only) numpy arrays, why aren't they in scipy? Example: linear models.
There are estimators like tree-based ones that could easily deal directly with heterogeneous data structures (as long as each column can be sorted). Therefore, would it make sense to use a heterogeneous internal data structure?
Which internal data structure is the best one for
- ease of use (I want to use column names, sample props, ...)
- computational efficiency
- memory efficiency (does the input data stay untouched?)
- compatibility to and dependency on other libraries and frameworks
If you were to build scikit-learn again next year, which solution would you chose?

¹ So some questions better stay unanswered:smirk:

thomasjpfan commented 4 years ago

This SLEP is about the first point, the internal data structure, and therefore touches the very foundation and goes far beyond feature names

One of the motivation goals is to get feature names. This SLEP describes an approach to get there and what the ramifications are for this approach.

From my point of view, we can not really separate the data structure into internal and external (in terms of being in a pipeline). When a third party constructs a custom transformer, transform would need to output a data structure based on a configuration flag. In all cases, we will be putting more requirements on third party estimators.

The most efficient thing to do is to have our own data structure, where we have complete control of it (i.e. InputArray SLEP). Users would need to convert this data structure into a pandas dataframe to be able to use it with other libraries. Since this is a new data structure, third parties will need to learn about how to construct and use it.

This SLEP focuses on using pandas, which most developers in the PyData ecosytem knows how to use. The biggest downside is that we do not have control over it and it will most likely not be as efficient.

A mixture between the two approaches would be great:

When a transformer is in a pipeline, output the internal data structure so we can optimize it.
When the transformer is by itself, output a pandas dataframe so end users can use it with their favorite framework.

We would create helper functions for developers so they can more easily support this.

amueller commented 4 years ago

We could distinguish between optimized internal and external and it's something I have thought about. I don't think we really gain a lot by making this distinction as it would complicate the interface with a negligible impact on performance. We would still have to implement the "external" interface within the pipeline to support third-party estimators, as @thomasjpfan says.

I don't really think this is about efficiency, it's about interfaces. As I said, I'm not really concerned with a potential copy right now. But even if we consider that, I don't think there's a clear answer to what the "most efficient" approach would be. The "most efficient" data structure for OneHotEncoding, gradient boosting and logistic regression might also not be the same.

I'm answering some of @lorentzenchr's questions below.

- If all estimators swallow (only) numpy arrays, why aren't they in scipy? Example: linear models. Why should they be? Ideally scipy should be split up. The only reason it exists the way it does is because packaging code with fortran dependencies used to be hard before we had conda and wheels. Today, I hope no-one would create a monster like that. - There are estimators like tree-based ones that could easily deal directly with heterogeneous data structures (as long as each column can be sorted). Therefore, would it make sense to use a heterogeneous internal data structure? That is likely going to be quite inefficient. What is best in the trees is to use binning and then everything will be an integer which allows using homogeneous arrays and everything will be very efficient. Right now pandas doesn't have a C interface, so it's basically a non-starter (though it might get one at some point). Even if there was a C interface, dealing with variable column types in a strongly typed language is likely to be an inefficient nightmare. - If you were to build scikit-learn again next year, which solution would you chose? With a global config flag we can make relatively arbitrary backward-incompatible changes. So we're not really bound to anything and that is the main question I think we're trying to answer.

adrinjalali commented 4 years ago

I quite agree with @amueller 's assessment on internal vs external data structures. I really don't think we should go down that road.

I had another talk with @TomAugspurger , and I think:

Whatever we choose, should stay there for a while. This is kind of a big change and I don't think it'd be nice to introduce the change, and then realize we should change again in two years, and this is why I would consider the upcoming changes in pandas into account when considering the options.

That said, there are two aspects which make me be more in favour of xarray vs pandas. One is the changes which [most probably] coming to pandas, the other one is that xarray is much more lightweight and much closer to numpy arrays, hence less change would be required from our side.

On the change side, we have the columnar representation which would mean two guaranteed copies on the df.DataFrame(np.asarray(dataframe)) operation, even if they're all floats. I do believe this is not a desired behavior if the user wants to have feature names.

Another change is the NA vs nan distinction coming to pandas. I do think it'd be nice to distinguish between the two, i.e. one missing and one the output of 0/0 for instance, but the fact that numpy doesn't have support for missing values, would mean further distance between what operations should do on those values. For instance, would the imputer then fill the NA values but not nans since nans are not missing values anymore? It may be a good idea, but I'm not sure if we want to go down that road now. Also, we could have the support in imputers for them to understand the distinction and accept a pandas DataFrame, but I'm not sure if we'd want to expand that to the rest of the library.

On the other hand, xarray data structures are pretty much numpy arrays with some metadata attached to them. It'd be much easier for us to work with them since we've been doing it all along, and not much changes. It's also pretty lightweight, and my gut feeling is that they'd be open to some changes if needed be.

Another aspect is that xarray containers have been supporting other metadata than just the feature/row names, and it may be useful for us to attach them to the data (another way to do sample/feature/data props for instance).

When I think of these two data structures, I really feel the xarray ones are closer to what we'd like to have and work with. And to be clear, by no means I'm saying users should give us xarray containers. They can, but the way I imagine the workflow is:

df = fetch_openml(..., as_frame=True)
# do whatever needed with the dataframe

pipe = Pipeline(...).fit(df) # predictor pipeline
pipe.predict(df) # works and feature names are all set and passed through

Xt = Transformer().fit(df).transform(df) # returns an xarray
df_t = sklearn.utils.dataframe(Xt) # returns a pandas DataFrame

Also, I think there's value in supporting the xarray.Dataset for heterogeneous data. But need to experiment with it a bit further.

thomasjpfan commented 4 years ago

@adrinjalali In the imagined workflow, it seems like the transformer needs to understand pandas and xarray as input. Specifically, the output of the transformer is an xarray and if it is placed in a pipeline, the next transformer would need to take it as input.

This means third party estimators would need to do, xarray or pandas in -> xarray out?

TomAugspurger commented 4 years ago

Just my bias, but I'd like to see type(X) in -> type(X) out, at least for ndarray, pandas.DataFrame, and xarray.DataArray (possibly xarray.Dataset, not sure). I'm not sure if that makes sense for every transformer, but if it worked I think people would be happy. I think it also satisfies some of Adrin's concerns.

On the change side, we have the columnar representation which would mean two guaranteed copies on the df.DataFrame(np.asarray(dataframe)) operation, even if they're all floats. I do believe this is not a desired behavior if the user wants to have feature names.

If / when pandas switches to a column store (which isn't guaranteed at this point), users will face that double-copy when providing a DataFrame at the start of the pipeline. But if both DataFrame's and xarray objects are accepted, users can reduce that to a single copy up front with a DataFrame.to_xarray() at the start of the pipeline.

Another change is the NA vs nan distinction coming to pandas.

I think this is mostly orthogonal to the container discussion. This can happen today when a user provides a DataFrame with NAs, and scikit-learn converts it to an object-dtype ndarray (which will presumably cause errors). Pandas adding support for nullable floating & datetime dtypes will increase the frequency with which users hit this, but that's a difference in degree, not time. It'd be nice for scikit-learn to natively understand this "NA thing", but I think that's true regardless of whether it accepts & returns DataFrames. It'll require coordinate across projects (including NumPy and possible Python itself).

Another aspect is that xarray containers have been supporting other metadata than just the feature/row names, and it may be useful for us to attach them to the data (another way to do sample/feature/data props for instance).

Pandas recently added .attrs, which mirrors xarray's and H5py's. It's currently experimental but hasn't had any bug reports, so we'll move it to stable soon. I think transformers can reasonably rely on these behaving the same between pandas and xarray.

amueller commented 4 years ago

First, I think it's important to capture all these arguments in a SLEP, so they are nicely summarized.

Second, I think @thomasjpfan makes an important point, which also was my first thought reading your message: we need to distinguish input and output. I agree that handling pandas input does make things tricky (I think more so than pandas output). However, I think we do want to handle pandas input. So using xarray as output will not really save us from that. @TomAugspurger made this great table in https://github.com/scikit-learn/enhancement_proposals/pull/25#issuecomment-571640817 ( I think the table has a typo though, the last row appears twice and ndarray input doesn't appear for dataframe fit)

On the change side, we have the columnar representation which would mean two guaranteed copies on the df.DataFrame(np.asarray(dataframe)) operation, even if they're all floats. I do believe this is not a desired behavior if the user wants to have feature names.

Shouldn't that be only a single copy?

To a first approximation, I think we can ignore the input to fit to make the table a bit easier, and look at transform input vs transform output. So @adrinjalali's proposal is

Transform Input	Transform Output
ndarray	ndarray
DataArray	DataArray
DataFrame	DataArray
AnythingElse	ndarray

@TomAugspurger's proposal is

Transform Input	Transform Output
ndarray	ndarray
DataArray	DataArray
DataFrame	DataFrame
AnythingElse	ndarray

and I my proposal was

Transform Input	Transform Output
ndarray	ndarray
DataFrame	DataFrame
AnythingElse	ndarray

I'm not opposed to @tomaugspurger's proposal either. I feel like now we're slowly going into duck-array territory. At some point for duck arrays / NEP 37 etc we might get type(X) in -> type(X) out for some types for some estimators and then that table becomes much more tricky.

amueller commented 4 years ago

@adrinjalali can you maybe rephrase your arguments in terms of how xarray is better for outputs? And/or argue why pandas special behaviors are not an issue for inputs?

amueller commented 4 years ago

This means third party estimators would need to do, xarray or pandas in -> xarray out?

They could always fallback to ndarray for either of them and things would "work" but it would swallow feature names and other meta-data. It's a good question of whether whatever we decide here will be enforced in some way through estimator checks. I guess that depends a bit on whether we want to "just" have a config flag, or also aim to make this the default behavior.

TomAugspurger commented 4 years ago

Shouldn't that be only a single copy?

Too in the weeds for this discussion, but it's at least 1, possibly two (depending on if we automatically go to Arrow memory 😄). Safe to say it's one or more copies.

They could always fallback to ndarray for either of them and things would "work" but it would swallow feature names and other meta-data.

That's the hardest part for me... It essentially tying feature names to the ability of a transformer to return a DataFrame (or DataArray). Is that an acceptable limitation for scikit-learn?

amueller commented 4 years ago

That's the hardest part for me... It essentially tying feature names to the ability of a transformer to return a DataoFrame (or DataArray). Is that an acceptable limitation for scikit-learn?

I think it is acceptable. In any case it will always be tied to whether the transformer implements how to transform feature names. So they will never be entirely free for someone implementing an estimator (unless you automatically generate them using some inherited property but that's probably still not automatic)

adrinjalali commented 4 years ago

So @adrinjalali's proposal is ...

Not really. My proposal is that the output only depends on the global flag the user sets to whether or not feature names are enabled. If yes, always return a DataArray (or a DataSet), if not, always return an ndarray, independent of the input.

amueller commented 4 years ago

@rth we're having a meeting discussing the SLEP right now :) https://www.google.com/url?q=https://anaconda.zoom.us/j/158272910?pwd%3DNURzbWhTSXNYNFhRMVYxRFVybjUrdz09&sa=D&usd=2&usg=AOvVaw0d6Fs7-D2t5Pyz28va20Wd

amueller commented 4 years ago

@adrinjalali all of them are conditional on a flag, so I would summarize your proposal as

Transform Input	Transform Output
ndarray	DataArray
DataArray	DataArray
DataFrame	DataArray
AnythingElse	DataArray

rth commented 4 years ago

Thanks for working on this it's certainly a challenging topic!

To add to the xarray vs pandas discussion on the subject of sparse arrays; below are quick benchmark results for an input CSR matrix with n_samples=1000, n_features=100k, sparse density=0.01 (i.e. 1M non zero elements):

scipy: copy (1ms), CSR->COO (3ms), CSR->CSC (11 ms)
pandas: using pd.DataFrame.sparse.from_spmatrix (17s i.e. >2000x slower than CSR->COO conversion). While things are better in low dimensional space (e.g. with OneHotEncoder) for any application where feature dimensionality is large (e.g. text vectorization) this is IMO not acceptable. In a different regime, with n_samples=1M, n_features=100, this conversion is ~5x slower than CSR>COO conversion @ 40 ms). pandas stores each column as a pd.array.SparseArray as far as I understand (i.e. will create n_features such objects) which is not fully equivalent to using CSC, and the reduced performance in high dimensionality seems to be expected https://github.com/pandas-dev/pandas/issues/32196 (no sure if there is a way to address it).
xarray: doesn't have a built-in sparse array support, however it works with objects that expose ndarray interface, and in particular pydata/sparse. It only supports COO but conversion times with xarray.DataArray(sparse.COO.from_scipy_sparse(X)) are reasonable (20ms with the original data). It would have also worked with scipy.sparse directly if someone wrote a sparse ndarray class (https://github.com/scipy/scipy/issues/8162). As a thin wrapper around the data with added labels I do like the xarray approach. Pure speculation, but I imagine it there was an ndarray object that supported categorical dtype it shouldn't be overly complex to make it work with ndarray. The downside of xarray for sparse (at least for now) that it needs to install pydata/sparse which in turns depends on numba and llvmlight. It may have some performance benefits in the long term, but it's still 3 extra dependencies.

Benchmark code can be found here.

TomAugspurger commented 4 years ago

FYI, pandas has our monthly dev call today at 18:00 UTC (~3 hours from this post). I've added some items that came up here to the agenda: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing. All are welcome to join if anyone has anything to add.

GaelVaroquaux commented 4 years ago

Another FYI: I've kickstarted a discussion on a dataframe protocol, which I find very insightful in terms of the discussion that we are having here: https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267/30

If the discussion above would pan-out, it would help in general for the input, but also for the output (giving us more freedom on the output choice).

@TomAugspurger memory copies really bother me. Memory usage is a weak point of our ecosystem. Going out-of-core solves it, but brings in a very large complexity. Is having 2 data structures in Pandas, with 2 different memory layout, one optimized for the relational algebra, which would induce memory copies, one optimized for "primal-space analytics", ie row-wise addressing, which would not induce memory copies? This tension with two usecases, and the relating computing tradeoffs in the data structures is a classic. It is present for instance with the sparse matrices, for which there are different layout, optimal for different types of algorithms.

amueller commented 4 years ago

I'd really like to have a strategy for moving forward with this. I don't think the outcome of the dataframe discussion is a prerequisite for that.

@adrinjalali suggested that @thomasjpfan writes a proposal / implementation with xarray.

I'm not sure how helpful that will be. It think we need to either agree on a list of evaluation criteria for a proposal or at least some requirements, and a way to actually make a decision. As mentioned on the call, I'm not super in favor of forcing a vote, but also I don't want to keep discussing and making prototypes for ever. There is probably no simple answer. Also, if we vote on individual sleps (like this one) we can also end up in the situation that we reject all proposals, which also doesn't help things.

@GaelVaroquaux re memory copies: sklearn already makes tons of copies, so honestly it's not what I'm most concerned about. If this was a major concern for you, I would have expected this to show up on the roadmap somewhere. We don't even have a process to measure how many copies are being made, right? Also, if it's an opt-in procedure it might not impact existing workflows as much.

Given a standard pipeline with scaling, imputation and OneHotEncoder, how much memory do we need right now? 1x data size? 2x data size? 10x data size? I honestly don't know the answer (though I would be surprised if it's as small as 1 or 2).

GaelVaroquaux commented 4 years ago

@GaelVaroquaux re memory copies: sklearn already makes tons of copies, so honestly it's not what I'm most concerned about. If this was a major concern for you, I would have expected this to show up on the roadmap somewhere.

The French team has been working on reducing them, and on support for float32 over the last year.

Also, there has been a lot of work in joblib / loky to memmap and try to share as much as possible across worker.

So I think that yes, my comment is consistent with our strategy. These efforts amount to several person-years of work.

We don't even have a process to measure how many copies are being made, right?

memprofiler, and integration in sphinx-gallery (both also pushed forward by us).

Given a standard pipeline with scaling, imputation and OneHotEncoder, how much memory do we need right now? 1x data size? 2x data size? 10x data size?

Well, the more I study missing values, the more I am convinced that any sophisticated imputation shouldn't be part of a "standard pipeline" :).

I would say, typically x3. I'd like to get this down, not up.

amueller commented 4 years ago

I know your team has worked on some cool float32 stuff and I don't want to belittle that work at all. However I haven't seen any benchmarks for a pipeline; and maybe you've done them but not share them?

And I agree, we probably don't want fancy imputation, but we probably still need mean imputation.

Generally I'd love to get the memory copies down, but I don't like making an argument using something for which I haven't seen measurements. I simply don't know if there is a common case where this copy would make or break a workflow.

GaelVaroquaux commented 4 years ago

Fair enough. We should prototype and measures before doing any choice. I agree with you.

amueller commented 4 years ago

Anyway, my main point is that we should make a plan on how we will make a decision.

amueller commented 4 years ago

So what would you want to prototype and measure and how would you make decisions based on the outcome?

I'm pretty sure any outcomes would depend on the choice of pipeline that's used and I can imagine that under some conditions pandas-in pandas-out will require less memory, while it will probably require slightly more in most cases (assuming a memory copy there which we can't really benchmark because that implementation of pandas doesn't exist).

I'm ok with setting up goals and criteria but I think just writing up one more prototype (we have like 5 or so right now?) will help us make a decision.

Is there an amount of memory overhead we're willing to accept? And how would we trade memory overhead vs user interface cost?

jnothman commented 4 years ago

I also think that worrying about a memory copy that might happen in the future only if pandas implements things a certain way, only if the user provides pandas input (while ndarray input continues to be cheap) and while the status quo and many valuable DataFrame use cases already entail a copy, is probably premature optimisation. We should likely be biased towards pandas because that's what users already expect to work.

TomAugspurger commented 4 years ago

I share Joel's outlook: the potential pandas refactor that would induce the additional memory copy is uncertain. It's some years off at a minimum, and I wouldn't be surprised if it never happened. We discussed this at our dev call on Wednesday (which @thomasjpfan joined) but aren't at the point where we can give any guidance; things are just too uncertain.

Not to derail things with a new proposal at this late hour, but... we have two potential contenders for <labeled array in> -> <labeled array out>. Which means a protocol! Scikit-learn could define something like (With dummy implementations for DataFrame).

def __sklearn_feature_names__(self) -> List[str]:
    return list(self.columns)

def __sklearn_data__(self, ...):  # dtype? Anything else?
    """Return the values for the estimator."""
    return self.to_numpy()

def __sklearn_transformed_result__(self, original, transformed, feature_names):  # original?
    """Wrap the result"""
    return pd.DataFrame(transformed, index=original.index, columns=feature_names)

I'm sure pandas (and likely xarray) would happily implement that protocol. (We're also more comfortable diving into our internals to avoid memory copies, but that's another story). However, I think it's premature to set down a protocol. I personally don't have a good opinion on what sorts of arguments would be helpful in these methods. I think we might want to learn a bit first, before freezing the API in a protocol.

NicolasHug commented 4 years ago

So what would you want to prototype and measure

We set up a bunch of realistic pipelines and check the memory usage with memory_profiler. We can also set up fake pipelines with dummy transformers that never/always copy for a best/worst case scenario. We can simulate pandas potential future behavior by artificially introducing copies, e.g. with a wrapper for dataframes.

Now about copies, dumb question: how bad can it be?

Say we pass X_orig to a pipeline of transformers that, for the sake of the argument, makes copies of the input at every single step.

X_orig -> STEP1 (copy) -> ... STEPN (copy)

Each transformer looks like


def fit_transform(self, X):
    X_copy = validate(X)  # make a copy
    X_out = # allocate output
    # do the work ... 
    return X_out

So at any point during the execution, we have at most 4 datasets in memory:

X_orig
the input X
the copy X_copy
the output X_out.

In a best case scenario where we never copy, we still have at least 3 datasets (we just don't have X_copy).

That's only 1 extra dataset for the worst case scenario then?

TomAugspurger commented 4 years ago

I had a realization last night (which I think was already known to others like @jreback). I think that the memory copy concern may not be an issue even if / when pandas goes to a column store in the future. The rest of this post is going to go into some internal pandas details.

Consider a user with a DataFrame df and a pipeline make_pipeline(StandardScaler(), PCA())

There's two places where we need to watch out for memory copies:

DataFrame -> ndarray (inside check_array say)
2D ndarray -> DataFrame (inside the DataFrame constructor)

Let's actually start with the second case, which is pretty easy and sets up the first. In a column-store future, pandas will "split" this 2D ndarray into a sequence of 1D arrays. Each of these 1D arrays will be views on the original memory, so no memory copies.

In [2]: a = np.ones((10, 5))

In [3]: slices = {f'{i}': a[:, i] for i in range(a.shape[1])}

In [4]: df = pd.DataFrame(slices)
In [5]: df._data.blocks  # 2D array split into 5 "blocks". This doesn't happen today, but may in the future
Out[5]:
(FloatBlock: slice(0, 1, 1), 1 x 10, dtype: float64,
 FloatBlock: slice(1, 2, 1), 1 x 10, dtype: float64,
 FloatBlock: slice(2, 3, 1), 1 x 10, dtype: float64,
 FloatBlock: slice(3, 4, 1), 1 x 10, dtype: float64,
 FloatBlock: slice(4, 5, 1), 1 x 10, dtype: float64)

In [6]: df._data.blocks[0].values.base  # each block is a view on 'a'
Out[6]:
array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

(@jorisvandenbossche can confirm that you can create pyarrow arrays as a view on and ndarray with no missing values for primative types like int64?)

Now, for potential copy 1, which is where pandas can get creative in DataFrame.__array__. For the special case of np.asarray(DataFrame(2d_array)), we an check that the .values.base of each array is the same object. In this case, we can just return that .base, no memory copy!

In [19]: b0 = df._data.blocks[0].values.base

In [20]: if all(blk.values.base is b0 for blk in df._data.blocks):
    ...:     arr = b0
    ...:

In [21]: arr
Out[21]:
array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

So tldr: I think we're able to do things without memory copies from 2D ndarray -> DatFrame -> 2d ndarray, even in a future where pandas has only 1D blocks.

jnothman commented 4 years ago

There seems to be some consensus here that we should be sticking to pandas out for now. Any strong objections? Can we progress on resolving the comments and moving towards vote?

adrinjalali commented 4 years ago

+1

thomasjpfan commented 4 years ago

Okay I will move this forward.

The only thing I am concerned about is the sparse performance (as noted in https://github.com/scikit-learn/enhancement_proposals/pull/37#issuecomment-596577388).

Should we hedge a little and adjust the configuration flag to:

set_config(array_in_out='pandas')

where the default is 'ndarray'. This leaves open the possibility of supporting other dataframe-like objects.

NicolasHug commented 4 years ago

Following up on https://github.com/scikit-learn/enhancement_proposals/pull/37#issuecomment-598449861, I too would be more comfortable if I could inform my voting decision with benchmark results.

amueller commented 4 years ago

@NicolasHug what would you measure? @TomAugspurger laid out a zero-copy strategy. So would you want to do a prototype implementation of that and then measure if there's any unforeseen consequences? Or do you mean with the current implementation that we also expect to have no memory copy to confirm that there is indeed none?

amueller commented 4 years ago

@thomasjpfan just to be clear, your proposal is to always create a pandas dataframe for array_in_out='pandas', even if the input is an ndarray or sparse matrix, right?

I mean we could have an option 'pandas' and an option 'pandas_if_pandas_in' but I'm not sure how user-friendly that is.

If we always return pandas, I am a bit concerned in the sparse case, because that will mean a memory copy from my understanding, right?

Btw, @adrinjalali in your proposal, did you also suggest always returning a DataArray for sparse data?

My understanding of @thomasjpfan's proposal is have an option to do

Transform Input	Transform Output
ndarray	DataFrame
scipy.sparse	DataFrame
DataArray	DataFrame
DataFrame	DataFrame
AnythingElse	DataFrame

with the potential to in the future have the same for DataArray by doing array_out='xarray'.

I wonder if

Transform Input	Transform Output
ndarray	ndarray
scipy.sparse	scipy.sparse
AnythingElse	DataFrame

or even

Transform Input	Transform Output
ndarray	DataFrame
scipy.sparse	scipy.sparse
AnythingElse	DataFrame

might be more effective.

scikit-learn / enhancement_proposals

SLEP 014 Pandas in Pandas out #37

pandas pros

pandas cons

xarray pros

xarray cons