pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

BUG: from_dict() hidden (correct) behavior not aligned with documentation and typing: accepts and processes lists of dicts #58862

Open pmaier-bhs opened 5 months ago

pmaier-bhs commented 5 months ago

Pandas version checks

Reproducible Example

# %%
import pandas as pd

# %%
b = [
    {"key1": "value1", "key2": 42},
    {"key1": "value2", "key2": 123},
]
df = pd.DataFrame.from_dict(b)  # type: ignore
print(df)

Issue Description

By the documentation, and also by the signatures defined in pandas-stubs, from_dict should not parse b. But it does. Is this maybe deprecated behavior?

See also https://github.com/pandas-dev/pandas-stubs/issues/929 and https://github.com/pandas-dev/pandas-stubs/issues/928

Expected Behavior

In accordance with documentation and type signatures (pandas-stubs), from_dict should reject processing b.

Installed Versions

/Users/pmaier/Desktop/projects/machine-data-layer/dist/export/python/virtualenvs/python-default/3.10.13/lib/python3.10/site-packages/_distutils_hack/__init__.py:26: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.13.final.0 python-bits : 64 OS : Darwin OS-release : 23.5.0 Version : Darwin Kernel Version 23.5.0: Wed May 1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 70.0.0 pip : 23.0.1 Cython : None pytest : 8.2.1 hypothesis : 6.103.0 sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.24.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.9.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 14.0.1 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
Dr-Irv commented 5 months ago

It's unclear whether this is a documentation issue (should sequences of dicts be allowed in from_dict()) or an implementation issue, where sequences of dicts should be rejected.

chaoyihu commented 5 months ago

I agree with @Dr-Irv.

The function signature and type hints of from_dict() indicates that data should be a dict, but in the code sample it also works when data is a list of dicts.

The discrepancy occurs because when executing the code example, from_dict() passes data to the DataFrame class constructor, which accepts data as a list of dicts.

This does not raise an error in execution since Python runtime does not enforce type hints, but may cause issues with third-party tools.

I think there are two ways to fix this:

  1. If it is intended that from_dict() should accept a list of dicts, the function's type hints and signature should be corrected to fit its behavior.
  2. Otherwise, a type check can be added to from_dict() to warn or raise an error when data is passed as a list of dicts.