pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Enable zero-copy `to_dataframe` #9792

Open rabernat opened 4 days ago

rabernat commented 4 days ago

What is your issue?

Calling Dataset.to_dataframe() currently always produces a memory copy of all arrays. This is definitely not optimal for all scenarios. We should make it possible to convert Xarray objects to Pandas objects without a memory copy.

This behavior may depend on Pandas version. As of 2.2, here are the relevant Pandas docs: https://pandas.pydata.org/docs/user_guide/copy_on_write.html

Here's the key point:

Constructors now copy NumPy arrays by default

The Series and DataFrame constructors will now copy NumPy array by default when not otherwise specified. This was changed to avoid mutating a pandas object when the NumPy array is changed inplace outside of pandas. You can set copy=False to avoid this copy.

When we construct DataFrames in Xarray, we do it like this

https://github.com/pydata/xarray/blob/d5f84dd1ef4c023cf2ea0a38866c9d9cd50487e7/xarray/core/dataset.py#L7386-L7388

Here's a minimal example

import numpy as np
import xarray as xr
ds = xr.DataArray(np.ones(1_000_000), dims=('x',), name="foo").to_dataset()
df = ds.to_dataframe()
print(np.shares_memory(df.foo.values, ds.foo.values))  # -> False

# can see the memory locations
print(ds.foo.values.__array_interface__)
print(df.foo.values.__array_interface__)

# compare to this
df2 = pd.DataFrame(
    {
        "foo": ds.foo.values,
    },
    copy=False
)
np.shares_memory(df2.foo.values, ds.foo.values)  # -> True

Solution

I propose we add a copy keyword option to Dataset.to_dataframe() (and similar for DataArray) which defaults to False (current behavior) but allows users to select True if that's what they want.