Closed thomasjpfan closed 1 year ago
I think this should be slep 14
Another point is to talk abut pandas
being a soft dependency in the SLEP, I guess; I'm not sure if we need to talk about how.
One of the reasons I think we should consider xarray, is that we can attach arbitrary feature/sample/data props to the data
pandas 1.0 added a DataFrame.attrs
dictionary that behaves the same as DataArray.attrs, so xarray and pandas should be the same in this regard.
and we may be able to convince them to remove a hard pandas dependency (correct me if I'm wrong @TomAugspurger)
In theory, I think it's possible to have a Variable
(which backs a DataArray) without any coordinates (and so no need for a pandas Index to store the labels). I don't have a good idea of how open the xarray devs are to extracting that / making it possible to use xarray without pandas.
Should we consider the new pandas nan in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.
The pd.Na
feature is consider experimental. As stated in their docs:
Experimental: the behaviour of pd.NA can still change without warning.
It also seems like DataFrame.attrs is experimental as well.
In both cases, I think we should wait for the features to stabilize before designing with them in mind.
Should we consider the new pandas nan in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.
I personally think it's interesting that scikit-learn would consider supporting it, but IMO it is somewhat orthogonal to this SLEP. Also in the current situation of converting DataFrames to arrays on input to estimators, the question of supporting pd.NA already comes up.
I think the scope is only transform
. If that's not clear from the SLEP it needs to be clarified. Or are you saying it should also be predict
etc? I'm not sure how I feel about that ;)
I think transforms is the most important to start with, so that seems the best scope. But yes, so I mainly wanted to indicate that this should be mentioned more clearly in the SLEP then ;)
Do we want to have an in-depth discussion of xarray vs pandas in this slep? In this issue? In #35?
My brief opinion:
I assume @adrinjalali's pro and con list is different from mine since his preference is xarray ;)
xarray cons
- homogeneous data only
Isn't it possible to have it with an xarray Dataset
?
That's true. I thought we were considering DataArray
not DataSet
. If we're using DataSet
we have to basically reimplement the pandas block manager, right?
If we're using DataSet we have to basically reimplement the pandas block manager, right?
I don't know enough about it but it seems that you are right.
I also don't know much about it, that was an educated guess based on the docstring ;)
Same educated guest then :)
Sent from my phone - sorry to be brief and potential misspell.
Is the copying issue in pandas a real one? I would expect that we could petition pandas to make the data F-contiguous and export a non-copied values array, in the case that the frame is being constructed anew with a homogeneous dtype. This would be the case for almost all of our transformer outputs.
Do we need to choose One format? It seems that a big challenge in this feature altogether is trying to support different formats (non ndarray) and their proliferation. Should we be providing for adapters to support different types?
PS: I've not read much of the above
Is the copying issue in pandas a real one?
It's unclear to me at least. I talked to @WillAyd a couple of days ago and he seem to think a zero-copy round-trip is feasible even when moving to a column store.
Do we need to choose One format? We don't need to choose "One Format" but for each input type, we need to choose one output format, also see https://github.com/scikit-learn/enhancement_proposals/pull/25#issuecomment-571640817.
And I guess then the question is which ones should we implement.
Should we be providing for adapters to support different types? Are adapter meta-estimators? I think that would introduce too much complexity. They could be a stop-gap but I'd rather have a better solution.
So far I don't think any of the proposed duck-array protocols would help us here. If we want to support arbitrary types, we'd need a protocol that basically allows us to separate the numpy array from the meta-data and then recreate a new duck array with a new numpy array and the old meta-data. I'm not entirely sure in how far that's possible, and it seems a little bit overkill to me for now.
a zero-copy round-trip is feasible even when moving to a column store.
If you want a 2D numpy array, then a zero-copy roundtrip is not possible with a column store AFAIK (it needs to combine multiple 1D arrays into one 2D array, which always requires a copy?).
I'm not saying that it's a good idea, but in theory one could find out that the columns are sequential in memory and make a view of them together, right? (you'd probably also keep around the separate views so no-one deallocates it)
import numpy as np
X = np.random.normal(size=(100, 3)).copy("F")
asdf = {'a': X[:, 0], 'b': X[:, 1], 'c': X[:, 2]}
asdf['b'].__array_interface__['data'][0] == asdf['a'].__array_interface__['data'][0] + asdf['a'].itemsize * asdf['a'].size
True
Another 5 cents to this discussion.
In my view, there are 2 important design choices for a ML library concerning data structures:
This SLEP is about the first point, the internal data structure, and therefore touches the very foundation and goes far beyond feature names (at least in this pandas in pandas out version). This brings me to the following (provocative1) questions:
1 So some questions better stay unanswered:smirk:
This SLEP is about the first point, the internal data structure, and therefore touches the very foundation and goes far beyond feature names
One of the motivation goals is to get feature names. This SLEP describes an approach to get there and what the ramifications are for this approach.
From my point of view, we can not really separate the data structure into internal and external (in terms of being in a pipeline). When a third party constructs a custom transformer, transform
would need to output a data structure based on a configuration flag. In all cases, we will be putting more requirements on third party estimators.
The most efficient thing to do is to have our own data structure, where we have complete control of it (i.e. InputArray
SLEP). Users would need to convert this data structure into a pandas dataframe to be able to use it with other libraries. Since this is a new data structure, third parties will need to learn about how to construct and use it.
This SLEP focuses on using pandas, which most developers in the PyData ecosytem knows how to use. The biggest downside is that we do not have control over it and it will most likely not be as efficient.
A mixture between the two approaches would be great:
We would create helper functions for developers so they can more easily support this.
We could distinguish between optimized internal and external and it's something I have thought about. I don't think we really gain a lot by making this distinction as it would complicate the interface with a negligible impact on performance. We would still have to implement the "external" interface within the pipeline to support third-party estimators, as @thomasjpfan says.
I don't really think this is about efficiency, it's about interfaces. As I said, I'm not really concerned with a potential copy right now. But even if we consider that, I don't think there's a clear answer to what the "most efficient" approach would be. The "most efficient" data structure for OneHotEncoding, gradient boosting and logistic regression might also not be the same.
I'm answering some of @lorentzenchr's questions below.
I quite agree with @amueller 's assessment on internal vs external data structures. I really don't think we should go down that road.
I had another talk with @TomAugspurger , and I think:
Whatever we choose, should stay there for a while. This is kind of a big change and I don't think it'd be nice to introduce the change, and then realize we should change again in two years, and this is why I would consider the upcoming changes in pandas into account when considering the options.
That said, there are two aspects which make me be more in favour of xarray vs pandas. One is the changes which [most probably] coming to pandas, the other one is that xarray is much more lightweight and much closer to numpy arrays, hence less change would be required from our side.
On the change side, we have the columnar representation which would mean two guaranteed copies on the df.DataFrame(np.asarray(dataframe))
operation, even if they're all float
s. I do believe this is not a desired behavior if the user wants to have feature names.
Another change is the NA
vs nan
distinction coming to pandas. I do think it'd be nice to distinguish between the two, i.e. one missing and one the output of 0/0
for instance, but the fact that numpy doesn't have support for missing values, would mean further distance between what operations should do on those values. For instance, would the imputer then fill the NA
values but not nan
s since nan
s are not missing values anymore? It may be a good idea, but I'm not sure if we want to go down that road now. Also, we could have the support in imputers for them to understand the distinction and accept a pandas DataFrame, but I'm not sure if we'd want to expand that to the rest of the library.
On the other hand, xarray data structures are pretty much numpy arrays with some metadata attached to them. It'd be much easier for us to work with them since we've been doing it all along, and not much changes. It's also pretty lightweight, and my gut feeling is that they'd be open to some changes if needed be.
Another aspect is that xarray containers have been supporting other metadata than just the feature/row names, and it may be useful for us to attach them to the data (another way to do sample/feature/data props for instance).
When I think of these two data structures, I really feel the xarray ones are closer to what we'd like to have and work with. And to be clear, by no means I'm saying users should give us xarray containers. They can, but the way I imagine the workflow is:
df = fetch_openml(..., as_frame=True)
# do whatever needed with the dataframe
pipe = Pipeline(...).fit(df) # predictor pipeline
pipe.predict(df) # works and feature names are all set and passed through
Xt = Transformer().fit(df).transform(df) # returns an xarray
df_t = sklearn.utils.dataframe(Xt) # returns a pandas DataFrame
Also, I think there's value in supporting the xarray.Dataset
for heterogeneous data. But need to experiment with it a bit further.
@adrinjalali In the imagined workflow, it seems like the transformer needs to understand pandas and xarray as input. Specifically, the output of the transformer is an xarray and if it is placed in a pipeline, the next transformer would need to take it as input.
This means third party estimators would need to do, xarray or pandas in -> xarray out?
Just my bias, but I'd like to see type(X) in -> type(X) out
, at least for ndarray, pandas.DataFrame, and xarray.DataArray (possibly xarray.Dataset, not sure). I'm not sure if that makes sense for every transformer, but if it worked I think people would be happy. I think it also satisfies some of Adrin's concerns.
On the change side, we have the columnar representation which would mean two guaranteed copies on the df.DataFrame(np.asarray(dataframe)) operation, even if they're all floats. I do believe this is not a desired behavior if the user wants to have feature names.
If / when pandas switches to a column store (which isn't guaranteed at this point), users will face that double-copy when providing a DataFrame at the start of the pipeline. But if both DataFrame's and xarray objects are accepted, users can reduce that to a single copy up front with a DataFrame.to_xarray()
at the start of the pipeline.
Another change is the NA vs nan distinction coming to pandas.
I think this is mostly orthogonal to the container discussion. This can happen today when a user provides a DataFrame with NAs, and scikit-learn converts it to an object-dtype ndarray (which will presumably cause errors). Pandas adding support for nullable floating & datetime dtypes will increase the frequency with which users hit this, but that's a difference in degree, not time. It'd be nice for scikit-learn to natively understand this "NA thing", but I think that's true regardless of whether it accepts & returns DataFrames. It'll require coordinate across projects (including NumPy and possible Python itself).
Another aspect is that xarray containers have been supporting other metadata than just the feature/row names, and it may be useful for us to attach them to the data (another way to do sample/feature/data props for instance).
Pandas recently added .attrs
, which mirrors xarray's and H5py's. It's currently experimental but hasn't had any bug reports, so we'll move it to stable soon. I think transformers can reasonably rely on these behaving the same between pandas and xarray.
First, I think it's important to capture all these arguments in a SLEP, so they are nicely summarized.
Second, I think @thomasjpfan makes an important point, which also was my first thought reading your message: we need to distinguish input and output. I agree that handling pandas input does make things tricky (I think more so than pandas output). However, I think we do want to handle pandas input. So using xarray as output will not really save us from that. @TomAugspurger made this great table in https://github.com/scikit-learn/enhancement_proposals/pull/25#issuecomment-571640817 ( I think the table has a typo though, the last row appears twice and ndarray input doesn't appear for dataframe fit)
On the change side, we have the columnar representation which would mean two guaranteed copies on the df.DataFrame(np.asarray(dataframe)) operation, even if they're all floats. I do believe this is not a desired behavior if the user wants to have feature names.
Shouldn't that be only a single copy?
To a first approximation, I think we can ignore the input to fit to make the table a bit easier, and look at transform input vs transform output. So @adrinjalali's proposal is
Transform Input | Transform Output |
---|---|
ndarray | ndarray |
DataArray | DataArray |
DataFrame | DataArray |
AnythingElse | ndarray |
@TomAugspurger's proposal is
Transform Input | Transform Output |
---|---|
ndarray | ndarray |
DataArray | DataArray |
DataFrame | DataFrame |
AnythingElse | ndarray |
and I my proposal was
Transform Input | Transform Output |
---|---|
ndarray | ndarray |
DataFrame | DataFrame |
AnythingElse | ndarray |
I'm not opposed to @tomaugspurger's proposal either. I feel like now we're slowly going into duck-array territory. At some point for duck arrays / NEP 37 etc we might get type(X) in -> type(X) out
for some types for some estimators and then that table becomes much more tricky.
@adrinjalali can you maybe rephrase your arguments in terms of how xarray is better for outputs? And/or argue why pandas special behaviors are not an issue for inputs?
This means third party estimators would need to do, xarray or pandas in -> xarray out?
They could always fallback to ndarray for either of them and things would "work" but it would swallow feature names and other meta-data. It's a good question of whether whatever we decide here will be enforced in some way through estimator checks. I guess that depends a bit on whether we want to "just" have a config flag, or also aim to make this the default behavior.
Shouldn't that be only a single copy?
Too in the weeds for this discussion, but it's at least 1, possibly two (depending on if we automatically go to Arrow memory 😄). Safe to say it's one or more copies.
They could always fallback to ndarray for either of them and things would "work" but it would swallow feature names and other meta-data.
That's the hardest part for me... It essentially tying feature names to the ability of a transformer to return a DataFrame (or DataArray). Is that an acceptable limitation for scikit-learn?
That's the hardest part for me... It essentially tying feature names to the ability of a transformer to return a DataoFrame (or DataArray). Is that an acceptable limitation for scikit-learn?
I think it is acceptable. In any case it will always be tied to whether the transformer implements how to transform feature names. So they will never be entirely free for someone implementing an estimator (unless you automatically generate them using some inherited property but that's probably still not automatic)
So @adrinjalali's proposal is ...
Not really. My proposal is that the output only depends on the global flag the user sets to whether or not feature names are enabled. If yes, always return a DataArray (or a DataSet), if not, always return an ndarray, independent of the input.
@rth we're having a meeting discussing the SLEP right now :) https://www.google.com/url?q=https://anaconda.zoom.us/j/158272910?pwd%3DNURzbWhTSXNYNFhRMVYxRFVybjUrdz09&sa=D&usd=2&usg=AOvVaw0d6Fs7-D2t5Pyz28va20Wd
@adrinjalali all of them are conditional on a flag, so I would summarize your proposal as
Transform Input | Transform Output |
---|---|
ndarray | DataArray |
DataArray | DataArray |
DataFrame | DataArray |
AnythingElse | DataArray |
Thanks for working on this it's certainly a challenging topic!
To add to the xarray vs pandas discussion on the subject of sparse arrays; below are quick benchmark results for an input CSR matrix with n_samples=1000, n_features=100k, sparse density=0.01 (i.e. 1M non zero elements):
scipy: copy (1ms), CSR->COO (3ms), CSR->CSC (11 ms)
pandas: using pd.DataFrame.sparse.from_spmatrix
(17s i.e. >2000x slower than CSR->COO conversion). While things are better in low dimensional space (e.g. with OneHotEncoder) for any application where feature dimensionality is large (e.g. text vectorization) this is IMO not acceptable. In a different regime, with n_samples=1M, n_features=100, this conversion is ~5x slower than CSR>COO conversion @ 40 ms). pandas stores each column as a pd.array.SparseArray
as far as I understand (i.e. will create n_features
such objects) which is not fully equivalent to using CSC, and the reduced performance in high dimensionality seems to be expected https://github.com/pandas-dev/pandas/issues/32196 (no sure if there is a way to address it).
xarray: doesn't have a built-in sparse array support, however it works with objects that expose ndarray interface, and in particular pydata/sparse
. It only supports COO but conversion times with xarray.DataArray(sparse.COO.from_scipy_sparse(X))
are reasonable (20ms with the original data). It would have also worked with scipy.sparse directly if someone wrote a sparse ndarray class (https://github.com/scipy/scipy/issues/8162). As a thin wrapper around the data with added labels I do like the xarray approach. Pure speculation, but I imagine it there was an ndarray object that supported categorical dtype it shouldn't be overly complex to make it work with ndarray. The downside of xarray for sparse (at least for now) that it needs to install pydata/sparse
which in turns depends on numba
and llvmlight
. It may have some performance benefits in the long term, but it's still 3 extra dependencies.
Benchmark code can be found here.
FYI, pandas has our monthly dev call today at 18:00 UTC (~3 hours from this post). I've added some items that came up here to the agenda: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing. All are welcome to join if anyone has anything to add.
Another FYI: I've kickstarted a discussion on a dataframe protocol, which I find very insightful in terms of the discussion that we are having here: https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267/30
If the discussion above would pan-out, it would help in general for the input, but also for the output (giving us more freedom on the output choice).
@TomAugspurger memory copies really bother me. Memory usage is a weak point of our ecosystem. Going out-of-core solves it, but brings in a very large complexity. Is having 2 data structures in Pandas, with 2 different memory layout, one optimized for the relational algebra, which would induce memory copies, one optimized for "primal-space analytics", ie row-wise addressing, which would not induce memory copies? This tension with two usecases, and the relating computing tradeoffs in the data structures is a classic. It is present for instance with the sparse matrices, for which there are different layout, optimal for different types of algorithms.
I'd really like to have a strategy for moving forward with this. I don't think the outcome of the dataframe discussion is a prerequisite for that.
@adrinjalali suggested that @thomasjpfan writes a proposal / implementation with xarray.
I'm not sure how helpful that will be. It think we need to either agree on a list of evaluation criteria for a proposal or at least some requirements, and a way to actually make a decision. As mentioned on the call, I'm not super in favor of forcing a vote, but also I don't want to keep discussing and making prototypes for ever. There is probably no simple answer. Also, if we vote on individual sleps (like this one) we can also end up in the situation that we reject all proposals, which also doesn't help things.
@GaelVaroquaux re memory copies: sklearn already makes tons of copies, so honestly it's not what I'm most concerned about. If this was a major concern for you, I would have expected this to show up on the roadmap somewhere. We don't even have a process to measure how many copies are being made, right? Also, if it's an opt-in procedure it might not impact existing workflows as much.
Given a standard pipeline with scaling, imputation and OneHotEncoder, how much memory do we need right now? 1x data size? 2x data size? 10x data size? I honestly don't know the answer (though I would be surprised if it's as small as 1 or 2).
@GaelVaroquaux re memory copies: sklearn already makes tons of copies, so honestly it's not what I'm most concerned about. If this was a major concern for you, I would have expected this to show up on the roadmap somewhere.
The French team has been working on reducing them, and on support for float32 over the last year.
Also, there has been a lot of work in joblib / loky to memmap and try to share as much as possible across worker.
So I think that yes, my comment is consistent with our strategy. These efforts amount to several person-years of work.
We don't even have a process to measure how many copies are being made, right?
memprofiler, and integration in sphinx-gallery (both also pushed forward by us).
Given a standard pipeline with scaling, imputation and OneHotEncoder, how much memory do we need right now? 1x data size? 2x data size? 10x data size?
Well, the more I study missing values, the more I am convinced that any sophisticated imputation shouldn't be part of a "standard pipeline" :).
I would say, typically x3. I'd like to get this down, not up.
I know your team has worked on some cool float32 stuff and I don't want to belittle that work at all. However I haven't seen any benchmarks for a pipeline; and maybe you've done them but not share them?
And I agree, we probably don't want fancy imputation, but we probably still need mean imputation.
Generally I'd love to get the memory copies down, but I don't like making an argument using something for which I haven't seen measurements. I simply don't know if there is a common case where this copy would make or break a workflow.
Fair enough. We should prototype and measures before doing any choice. I agree with you.
Anyway, my main point is that we should make a plan on how we will make a decision.
So what would you want to prototype and measure and how would you make decisions based on the outcome?
I'm pretty sure any outcomes would depend on the choice of pipeline that's used and I can imagine that under some conditions pandas-in pandas-out will require less memory, while it will probably require slightly more in most cases (assuming a memory copy there which we can't really benchmark because that implementation of pandas doesn't exist).
I'm ok with setting up goals and criteria but I think just writing up one more prototype (we have like 5 or so right now?) will help us make a decision.
Is there an amount of memory overhead we're willing to accept? And how would we trade memory overhead vs user interface cost?
I also think that worrying about a memory copy that might happen in the future only if pandas implements things a certain way, only if the user provides pandas input (while ndarray input continues to be cheap) and while the status quo and many valuable DataFrame use cases already entail a copy, is probably premature optimisation. We should likely be biased towards pandas because that's what users already expect to work.
I share Joel's outlook: the potential pandas refactor that would induce the additional memory copy is uncertain. It's some years off at a minimum, and I wouldn't be surprised if it never happened. We discussed this at our dev call on Wednesday (which @thomasjpfan joined) but aren't at the point where we can give any guidance; things are just too uncertain.
Not to derail things with a new proposal at this late hour, but... we have two potential contenders for <labeled array in> -> <labeled array out>
. Which means a protocol! Scikit-learn could define something like (With dummy implementations for DataFrame).
def __sklearn_feature_names__(self) -> List[str]:
return list(self.columns)
def __sklearn_data__(self, ...): # dtype? Anything else?
"""Return the values for the estimator."""
return self.to_numpy()
def __sklearn_transformed_result__(self, original, transformed, feature_names): # original?
"""Wrap the result"""
return pd.DataFrame(transformed, index=original.index, columns=feature_names)
I'm sure pandas (and likely xarray) would happily implement that protocol. (We're also more comfortable diving into our internals to avoid memory copies, but that's another story). However, I think it's premature to set down a protocol. I personally don't have a good opinion on what sorts of arguments would be helpful in these methods. I think we might want to learn a bit first, before freezing the API in a protocol.
So what would you want to prototype and measure
We set up a bunch of realistic pipelines and check the memory usage with memory_profiler. We can also set up fake pipelines with dummy transformers that never/always copy for a best/worst case scenario. We can simulate pandas potential future behavior by artificially introducing copies, e.g. with a wrapper for dataframes.
Now about copies, dumb question: how bad can it be?
Say we pass X_orig
to a pipeline of transformers that, for the sake of the argument, makes copies of the input at every single step.
X_orig -> STEP1 (copy) -> ... STEPN (copy)
Each transformer looks like
def fit_transform(self, X):
X_copy = validate(X) # make a copy
X_out = # allocate output
# do the work ...
return X_out
So at any point during the execution, we have at most 4 datasets in memory:
X_orig
X
X_copy
X_out
.In a best case scenario where we never copy, we still have at least 3 datasets (we just don't have X_copy
).
That's only 1 extra dataset for the worst case scenario then?
I had a realization last night (which I think was already known to others like @jreback). I think that the memory copy concern may not be an issue even if / when pandas goes to a column store in the future. The rest of this post is going to go into some internal pandas details.
Consider a user with a DataFrame df
and a pipeline make_pipeline(StandardScaler(), PCA())
There's two places where we need to watch out for memory copies:
check_array
say)Let's actually start with the second case, which is pretty easy and sets up the first. In a column-store future, pandas will "split" this 2D ndarray into a sequence of 1D arrays. Each of these 1D arrays will be views on the original memory, so no memory copies.
In [2]: a = np.ones((10, 5))
In [3]: slices = {f'{i}': a[:, i] for i in range(a.shape[1])}
In [4]: df = pd.DataFrame(slices)
In [5]: df._data.blocks # 2D array split into 5 "blocks". This doesn't happen today, but may in the future
Out[5]:
(FloatBlock: slice(0, 1, 1), 1 x 10, dtype: float64,
FloatBlock: slice(1, 2, 1), 1 x 10, dtype: float64,
FloatBlock: slice(2, 3, 1), 1 x 10, dtype: float64,
FloatBlock: slice(3, 4, 1), 1 x 10, dtype: float64,
FloatBlock: slice(4, 5, 1), 1 x 10, dtype: float64)
In [6]: df._data.blocks[0].values.base # each block is a view on 'a'
Out[6]:
array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
(@jorisvandenbossche can confirm that you can create pyarrow arrays as a view on and ndarray with no missing values for primative types like int64?)
Now, for potential copy 1, which is where pandas can get creative in DataFrame.__array__
.
For the special case of np.asarray(DataFrame(2d_array))
, we an check that the .values.base
of each array is the same object. In this case, we can just return that .base
, no memory copy!
In [19]: b0 = df._data.blocks[0].values.base
In [20]: if all(blk.values.base is b0 for blk in df._data.blocks):
...: arr = b0
...:
In [21]: arr
Out[21]:
array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
So tldr: I think we're able to do things without memory copies from 2D ndarray -> DatFrame -> 2d ndarray, even in a future where pandas has only 1D blocks.
There seems to be some consensus here that we should be sticking to pandas out for now. Any strong objections? Can we progress on resolving the comments and moving towards vote?
+1
Okay I will move this forward.
The only thing I am concerned about is the sparse performance (as noted in https://github.com/scikit-learn/enhancement_proposals/pull/37#issuecomment-596577388).
Should we hedge a little and adjust the configuration flag to:
set_config(array_in_out='pandas')
where the default is 'ndarray'
. This leaves open the possibility of supporting other dataframe-like objects.
Following up on https://github.com/scikit-learn/enhancement_proposals/pull/37#issuecomment-598449861, I too would be more comfortable if I could inform my voting decision with benchmark results.
@NicolasHug what would you measure? @TomAugspurger laid out a zero-copy strategy. So would you want to do a prototype implementation of that and then measure if there's any unforeseen consequences? Or do you mean with the current implementation that we also expect to have no memory copy to confirm that there is indeed none?
@thomasjpfan just to be clear, your proposal is to always create a pandas dataframe for array_in_out='pandas'
, even if the input is an ndarray or sparse matrix, right?
I mean we could have an option 'pandas'
and an option 'pandas_if_pandas_in'
but I'm not sure how user-friendly that is.
If we always return pandas, I am a bit concerned in the sparse case, because that will mean a memory copy from my understanding, right?
Btw, @adrinjalali in your proposal, did you also suggest always returning a DataArray for sparse data?
My understanding of @thomasjpfan's proposal is have an option to do
Transform Input | Transform Output |
---|---|
ndarray | DataFrame |
scipy.sparse | DataFrame |
DataArray | DataFrame |
DataFrame | DataFrame |
AnythingElse | DataFrame |
with the potential to in the future have the same for DataArray
by doing array_out='xarray'
.
I wonder if
Transform Input | Transform Output |
---|---|
ndarray | ndarray |
scipy.sparse | scipy.sparse |
AnythingElse | DataFrame |
or even
Transform Input | Transform Output |
---|---|
ndarray | DataFrame |
scipy.sparse | scipy.sparse |
AnythingElse | DataFrame |
might be more effective.
This SLEP proposes pandas in pandas out as an alternative to SLEP 012 InputArray.