modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.81k stars 651 forks source link

BUG: Writes to `DataFrame.attrs` are not preserved #7401

Open noloerino opened 6 days ago

noloerino commented 6 days ago

Modin version checks

Reproducible Example

import modin.pandas as pd
df.attrs["x"] = 1
df.attrs  # attrs dict is still empty

Issue Description

DataFrame.attrs lets users specify metadata on frames that are deep-copied to new dataframes when operations are performed. In Modin, attrs defaults to pandas, but this means that any writes to it are not reflected in the original frame, much less any other operations.

When a write to attrs is attempted, it only modifies the attrs field of the native pandas.DataFrame that's produced within DataFrame._default_to_pandas, and the modin.pandas.DataFrame has no knowledge of this operation.

Expected Behavior

Writes to attrs are reflected in subsequent read operations, and propagated across operations.

Error Logs

```python-traceback Replace this line with the error backtrace (if applicable). ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : 1c4d173d3b2c44a1c1b5d5516552c7717b26de32 python : 3.10.13.final.0 python-bits : 64 OS : Darwin OS-release : 23.6.0 Version : Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:04 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 Modin dependencies ------------------ modin : 0.32.0+6.g1c4d173d ray : 2.34.0 dask : 2024.8.1 distributed : 2024.8.1 pandas dependencies ------------------- pandas : 2.2.2 numpy : 1.26.4 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.0.0 pip : 23.3 Cython : None pytest : 8.3.2 hypothesis : None sphinx : 5.3.0 blosc : None feather : None xlsxwriter : None lxml.etree : 5.3.0 html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.4 IPython : 8.17.2 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat : None fastparquet : 2024.5.0 fsspec : 2024.6.1 gcsfs : None matplotlib : 3.9.2 numba : None numexpr : 2.10.1 odfpy : None openpyxl : 3.1.5 pandas_gbq : 0.23.1 pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2024.6.1 scipy : 1.14.1 sqlalchemy : 2.0.32 tables : 3.10.1 tabulate : None xarray : 2024.7.0 xlrd : 2.0.1 zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
noloerino commented 6 days ago

See pandas discussion: https://github.com/pandas-dev/pandas/issues/52166

Though attrs is not fully mature, it seems to be used pretty frequently in downstream libraries to track metadata for use cases like plot generation, and the feature seems to be here to stay.

pandas supports propagation of attrs through __finalize__, which Modin vacuously defaults to pandas. I think the least intrusive approach for us would be to keep attrs as a non-distributed, regular Python dict and track attrs at the query compiler level. However, it may be better to track attrs through __finalize__ like native pandas does, but this would require changing almost every frontend method to call this before returning.