pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.67k stars 17.92k forks source link

BUG: `Series` type as first item from list ,in new Dataframe disallows the assignment of `dicts` with same stracture #56322

Open NikosGour opened 11 months ago

NikosGour commented 11 months ago

Pandas version checks

Reproducible Example

import pandas as pd

list = [
        pd.Series({'Title': 'The Godfather', 'Year': '1972', 'Rated': 'R', 'Released': '24 Mar 1972', 'Runtime': '175 min'}),
        {'Title': 'The Godfather', 'Year': '1972', 'Rated': 'R', 'Released': '24 Mar 1972', 'Runtime': '175 min'}
        ]

df = pd.DataFrame(list, columns=['Title', 'Year', 'Rated', 'Released', 'Runtime'])

Issue Description

Description

This bug occurs if the Dataframe is initialized by passing an array with objects and the first item MUST be a Series object as shown in the example above.

Weird Observation

If we flip the order of the object in the array so that the dict object is first then the expected behaviour occurs , where the dataframe is created successfully with both object inside. example code and output:

import pandas as pd

list = [
        {'Title': 'The Godfather', 'Year': '1972', 'Rated': 'R', 'Released': '24 Mar 1972', 'Runtime': '175 min'},
        pd.Series({'Title': 'The Godfather', 'Year': '1972', 'Rated': 'R', 'Released': '24 Mar 1972', 'Runtime': '175 min'})
        ]

df = pd.DataFrame(list, columns=['Title', 'Year', 'Rated', 'Released', 'Runtime'])
print(df)

output :

           Title  Year Rated     Released  Runtime
0  The Godfather  1972     R  24 Mar 1972  175 min
1  The Godfather  1972     R  24 Mar 1972  175 min

Process finished with exit code 0

Expected Behavior

Expected behaviour

a Dataframe with both objects (that have the same fields) is created

Actual

the program crashes with following stack trace:

  df = pd.DataFrame(list, columns=['Title', 'Year', 'Rated', 'Released', 'Runtime'])
  File "/home/ledrake/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 806, in __init__
    arrays, columns, index = nested_data_to_arrays(
  File "/home/ledrake/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 520, in nested_data_to_arrays
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/home/ledrake/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 839, in to_arrays
    arr, columns = _list_of_series_to_arrays(data, columns)
  File "/home/ledrake/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 885, in _list_of_series_to_arrays
    aligned_values.append(algorithms.take_nd(values, indexer))
  File "/home/ledrake/.local/lib/python3.10/site-packages/pandas/core/array_algos/take.py", line 97, in take_nd
    fill_value = na_value_for_dtype(arr.dtype, compat=False)
AttributeError: 'dict' object has no attribute 'dtype'

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2a953cf80b77e4348bf50ed724f8abc0d814d9dd python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.15.90.1-microsoft-standard-WSL2 Version : #1 SMP Fri Jan 27 02:56:13 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.3 numpy : 1.26.1 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 59.6.0 pip : 22.0.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.16.1 pandas_datareader : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.1 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None Process finished with exit code 0
DarthKitten2130 commented 11 months ago

While I'm no expert, this may be because when you pass a Series (Serieses?) within a list, the DataFrame is trying to parse the dtype for every Series, under the assumption that every value in the list is a Series. When it tries to access the dtype value for the dictionary (which doesn't exist), it returns an AttributeError.

NikosGour commented 11 months ago

Yeah , i suspect the same , but still if you read the weird observation part , the opposite, for some weird reason ,works.