Calling Dataset.to_dataframe() currently always produces a memory copy of all arrays. This is definitely not optimal for all scenarios. We should make it possible to convert Xarray objects to Pandas objects without a memory copy.
The Series and DataFrame constructors will now copy NumPy array by default when not otherwise specified. This was changed to avoid mutating a pandas object when the NumPy array is changed inplace outside of pandas. You can set copy=False to avoid this copy.
When we construct DataFrames in Xarray, we do it like this
import numpy as np
import xarray as xr
ds = xr.DataArray(np.ones(1_000_000), dims=('x',), name="foo").to_dataset()
df = ds.to_dataframe()
print(np.shares_memory(df.foo.values, ds.foo.values)) # -> False
# can see the memory locations
print(ds.foo.values.__array_interface__)
print(df.foo.values.__array_interface__)
# compare to this
df2 = pd.DataFrame(
{
"foo": ds.foo.values,
},
copy=False
)
np.shares_memory(df2.foo.values, ds.foo.values) # -> True
Solution
I propose we add a copy keyword option to Dataset.to_dataframe() (and similar for DataArray) which defaults to False (current behavior) but allows users to select True if that's what they want.
What is your issue?
Calling
Dataset.to_dataframe()
currently always produces a memory copy of all arrays. This is definitely not optimal for all scenarios. We should make it possible to convert Xarray objects to Pandas objects without a memory copy.This behavior may depend on Pandas version. As of 2.2, here are the relevant Pandas docs: https://pandas.pydata.org/docs/user_guide/copy_on_write.html
Here's the key point:
When we construct DataFrames in Xarray, we do it like this
https://github.com/pydata/xarray/blob/d5f84dd1ef4c023cf2ea0a38866c9d9cd50487e7/xarray/core/dataset.py#L7386-L7388
Here's a minimal example
Solution
I propose we add a
copy
keyword option toDataset.to_dataframe()
(and similar forDataArray
) which defaults toFalse
(current behavior) but allows users to selectTrue
if that's what they want.