pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

ENH: DataFrame.from_dict doesn't work with collections.UserDict objects #59737

Open mesvam opened 2 months ago

mesvam commented 2 months ago

Pandas version checks

Reproducible Example

from collections import UserDict
class CustomDict(UserDict):
    pass
cd = CustomDict(col_1=[1,2], col_2=[3,4])
pd.DataFrame.from_dict(cd)

output:

       0
0  col_1
1  col_2

Issue Description

Pandas will not accept UserDict and other custom dict-like objects for DataFrame creation. Subclassing dict instead of UserDict is a workaround for this example, but for a variety of complicated reasons (examples: 1, 2, 3) it is sometimes undesirable to subclass dict.

Expected Behavior

The output should be the same as

pd.DataFrame.from_dict(dict(cd))

output

   col_1  col_2
0      1      3
1      2      4

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22000 machine : AMD64 processor : Intel64 Family 6 Model 151 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 72.1.0 pip : 24.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.4 numba : 0.60.0 numexpr : 2.8.7 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : 0.22.0 tzdata : 2023.3 qtpy : None pyqt5 : None
rhshadrach commented 2 months ago

Thanks for the report. I agree that this is unintuitive, but that is unfortunately because UserDict is not a subclass of dict. Because of this it appears to me that pandas is behaving in the way it is documented: cd is not a dict, so pandas treats it as the iterable that it is creating a single-column DataFrame. So I've reworked this issue as a feature request.

I'm supportive of treating UserDict as a dict as proposed here.

rhshadrach commented 2 months ago

As a workaround, you can always pass cd.data instead.

mesvam commented 1 month ago

Thanks.

I searched around and found a similar issue here https://github.com/pandas-dev/pandas/issues/34257, so I don't know if they're related and it's a more general issue with pandas or just pandas.DataFrame.from_dict specifically

Also, I can kind of understand if pd.DataFrame(dict_like) didn't work, but pd.DataFrame.from_dict(dict_like) feels like it should work with anything that inherits from collections.abc.Mapping

rhshadrach commented 1 month ago

Also, I can kind of understand if pd.DataFrame(dict_like) didn't work...

Why is that?

but pd.DataFrame.from_dict(dict_like) feels like it should work with anything that inherits from collections.abc.Mapping

Thanks - I did initially miss that the OP was about from_dict and not the constructor. Agreed here as well.

mesvam commented 1 month ago

Also, I can kind of understand if pd.DataFrame(dict_like) didn't work...

Why is that?

I think it would be ideal if it worked with pd.DataFrame as well as pd.DataFrame.from_dict, but my intuition was that since pd.DataFrame takes in a variety of formats, I sort of assumed it would be easier to for it to get confused by a non-standard input type. Like I could probably create an weird class that can be accessed both like an array and as a dict, and I wouldn't really expect pd.DataFrame to be able to correctly guess how I want it to access the object.

Whereas from_dict should sort of implicitly assume that the input is a dict-like object and parse it as such, even if it's not the exact type it was expecting.

thyripian commented 1 month ago

Could I take a go at this?