pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.59k stars 17.57k forks source link

BUG: #59060

Open jonmooser opened 1 week ago

jonmooser commented 1 week ago

Pandas version checks

Reproducible Example

import pandas as pd

try:
    s = pd.Series([1, 2, 'x', 4, 5], dtype=int)
    print("1. No exception")
except Exception as ex:
    print("1. Exception thrown", ex)

s = pd.Series([1, 2, 'x', 4, 5])
try:
    s2 = pd.Series(s, dtype=int)
    print("2. No exception")
except Exception as ex:
    print("2. Exception thrown", ex)

print("s2 dtype", s2.dtype)

Issue Description

The behavior of the above code doesn't seem right. If I try to create a Series with dtype=int using a list that contains a string, it throws a ValueError, as expected.

But if the input to the Series constructor is another Series that contains a string, there is no exception and it quietly returns a new Series with dtype=object.

Is this expected? If so it's very counterintuitive. Perhaps the recommended way is s.astype(int) but it seems like this should produce the same behavior

Expected Behavior

In both of the cases above, it should throw a ValueError. Instead, only the first case does:

1. Exception thrown invalid literal for int() with base 10: 'x'
1. No exception
s2 dtype object

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.6.final.0 python-bits : 64 OS : Darwin OS-release : 23.1.0 Version : Darwin Kernel Version 23.1.0: Mon Oct 9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.23.4 pytz : 2022.5 dateutil : 2.8.2 setuptools : 65.5.0 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
chaoyihu commented 2 days ago

@jonmooser Thanks for raising this issue! If I understand it correctly, you are trying to change the dtype of a Series that is already initialized.

I agree it is counter-intuitive that s2 = pd.Series(s, dtype=int) ignores the customized dtype silently without raising.

However, I don't think the constructor should be used for casting. As you have mentioned, a better way of doing this might be Series.astype, in which case a ValueError is raised as expected:

>>> s = pd.Series([1, 2, 'x', 4, 5])  # dtype of s: object
>>> s3 = s.astype('int64')
ValueError: invalid literal for int() with base 10: 'x'
jonmooser commented 1 day ago

Thanks for looking into this. Glad there is a work around but it would be great if future versions are more consistent and intuitive.