pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.7k stars 17.93k forks source link

BUG: ValueError when accessing dataFrame with array attribute #59196

Open Zybulon opened 4 months ago

Zybulon commented 4 months ago

Pandas version checks

Reproducible Example

import pandas as pd
import numpy as np

attrs = {"A": "B", "G": np.array([1.2, 2.4])}

# This one works
arr = np.random.rand(60, 1)
df_named = pd.DataFrame(arr)
df_named.attrs = attrs
print(df_named[0])

# This one works
arr = np.random.rand(61, 1)
df_named = pd.DataFrame(arr)
df_named.attrs = {"A": "B", "G": "A"}
print(df_named[0])

# This one does not works
arr = np.random.rand(61, 1)
df_named = pd.DataFrame(arr)
df_named.attrs = attrs
print(df_named)  # This works
print(df_named[0])  # This does not works

Issue Description

Hello,

I have a dataFrame of size (61,1) with 2 attributes (one is an array) and I can't print the first Serie of the DataFrame. I have the following Error :

Traceback (most recent call last):

  File ~\miniforge-pypy3\envs\h5pandas_dev\Lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File d:\documents\perso\travail\mbda\pandas_extension\h5pandas\tests\debug.py:23
    print(df_named[0])  # This does not works

  File ~\miniforge-pypy3\envs\h5pandas_dev\Lib\site-packages\pandas\core\series.py:1784 in __repr__
    return self.to_string(**repr_params)

  File ~\miniforge-pypy3\envs\h5pandas_dev\Lib\site-packages\pandas\core\series.py:1871 in to_string
    formatter = fmt.SeriesFormatter(

  File ~\miniforge-pypy3\envs\h5pandas_dev\Lib\site-packages\pandas\io\formats\format.py:225 in __init__
    self._chk_truncate()

  File ~\miniforge-pypy3\envs\h5pandas_dev\Lib\site-packages\pandas\io\formats\format.py:247 in _chk_truncate
    series = concat((series.iloc[:row_num], series.iloc[-row_num:]))

  File ~\miniforge-pypy3\envs\h5pandas_dev\Lib\site-packages\pandas\core\reshape\concat.py:395 in concat
    return op.get_result()

  File ~\miniforge-pypy3\envs\h5pandas_dev\Lib\site-packages\pandas\core\reshape\concat.py:650 in get_result
    return result.__finalize__(self, method="concat")

  File ~\miniforge-pypy3\envs\h5pandas_dev\Lib\site-packages\pandas\core\generic.py:6273 in __finalize__
    have_same_attrs = all(obj.attrs == attrs for obj in other.objs[1:])

  File ~\miniforge-pypy3\envs\h5pandas_dev\Lib\site-packages\pandas\core\generic.py:6273 in <genexpr>
    have_same_attrs = all(obj.attrs == attrs for obj in other.objs[1:])

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

However I can print the DataFrame, it does not raise the ValueError. If the DataFrame hasn't got the array attribute, I do not have ValueError. If the DataFrame has only 60 rows, I do not have ValueError.

Expected Behavior

I should not have this ValueError.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : en LOCALE : fr_FR.cp1252 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0 setuptools : 70.1.1 pip : 24.0 Cython : None pytest : 8.2.2 hypothesis : None sphinx : 7.3.7 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.26.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.4 numba : None numexpr : 2.8.7 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : 3.9.2 tabulate : 0.9.0 xarray : None xlrd : None zstandard : 0.22.0 tzdata : 2024.1 qtpy : 2.4.1 pyqt5 : None
crspencer11 commented 4 months ago

take

Anurag-Varma commented 3 months ago

Just did a debugging:

By default display.max_rows in pandas is set to 60.

But if you have more than 60 rows, its failing as mentioned in your above case.

To avoid it, you can do this - For example, if you want 100 rows max, then:

pd.set_option("display.max_rows", 100)

Then it will work, in case of any other value, replace 100 with that value.

Anurag-Varma commented 3 months ago

take