pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

BUG: Pandas.DataFrame.str.replace function fails silently for mixed data mixing strings and float/int and replaces with NaN #45372

Open JulianKlug opened 2 years ago

JulianKlug commented 2 years ago

Pandas version checks

Reproducible Example

import pandas as pd
import numpy as np
list_of_random_numbers = np.random.randint(0,100,size=10)
list_of_random_numbers_as_string = [str(x) for x in np.random.randint(0,100,size=10)]
list_of_random_numbers_as_string[0] = list_of_random_numbers_as_string[0] + "'"

df = pd.concat(
    [pd.DataFrame({'value':list_of_random_numbers}),
    pd.DataFrame({'value':list_of_random_numbers_as_string})], ignore_index=True)

df['value'] = df['value'].str.replace("'", "")
df.notnull().sum()

Issue Description

When replacing a string in a column with mixed data, this replaces all non-string data with NaN without a warning (silent failure).

See also: https://stackoverflow.com/q/43187436/3903778

Expected Behavior

Expected behaviour would either be an error or a warning, as this is the case when applying str.replace to data of non-string type (see example below).

import pandas as pd
import numpy as np
list_of_random_numbers = np.random.randint(0,100,size=10)
number_only_df = pd.DataFrame({'value': list_of_random_numbers})
number_only_df['value'] = number_only_df['value'].str.replace("'", "")

Fails with: AttributeError: Can only use .str accessor with string values!

Installed Versions

INSTALLED VERSIONS ------------------ commit : 73c68257545b5f8530b7044f56647bd2db92e2ba python : 3.8.11.final.0 python-bits : 64 OS : Darwin OS-release : 21.1.0 Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.3 numpy : 1.20.3 pytz : 2021.3 dateutil : 2.8.2 pip : 21.0.1 setuptools : 52.0.0.post20210125 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : 7.26.0 pandas_datareader: None bs4 : None bottleneck : 1.3.2 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
JulianKlug commented 2 years ago

Gist with more complete example: https://gist.github.com/JulianKlug/065c7f9918b5e5551e78b39e72375282

JulianKlug commented 2 years ago

For everyone simply looking the solution for this issue, it can be prevented by converting to string first, and then converting back to float, or only converting elements of type str.

df['value'] = df['value'].astype(str).str.replace("'", "").astype(float)
# OR
df['value'] = df['value'].apply(lambda x: x.replace("'",'') if type(x) is str else x)
nik1097 commented 2 years ago

Hey @JulianKlug could I please try working on this fix?

JulianKlug commented 2 years ago

@nik1097 : Sure! Although I am not part of the maintainers, any PR is welcome and we can then discuss the proposal from there