BUG: Pandas.DataFrame.str.replace function fails silently for mixed data mixing strings and float/int and replaces with NaN

JulianKlug commented 2 years ago

Pandas version checks

[x] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
list_of_random_numbers = np.random.randint(0,100,size=10)
list_of_random_numbers_as_string = [str(x) for x in np.random.randint(0,100,size=10)]
list_of_random_numbers_as_string[0] = list_of_random_numbers_as_string[0] + "'"

df = pd.concat(
    [pd.DataFrame({'value':list_of_random_numbers}),
    pd.DataFrame({'value':list_of_random_numbers_as_string})], ignore_index=True)

df['value'] = df['value'].str.replace("'", "")
df.notnull().sum()

Issue Description

When replacing a string in a column with mixed data, this replaces all non-string data with NaN without a warning (silent failure).

Expected Behavior

Expected behaviour would either be an error or a warning, as this is the case when applying str.replace to data of non-string type (see example below).

import pandas as pd
import numpy as np
list_of_random_numbers = np.random.randint(0,100,size=10)
number_only_df = pd.DataFrame({'value': list_of_random_numbers})
number_only_df['value'] = number_only_df['value'].str.replace("'", "")

Fails with: AttributeError: Can only use .str accessor with string values!

Installed Versions

INSTALLED VERSIONS ------------------ commit : 73c68257545b5f8530b7044f56647bd2db92e2ba python : 3.8.11.final.0 python-bits : 64 OS : Darwin OS-release : 21.1.0 Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.3 numpy : 1.20.3 pytz : 2021.3 dateutil : 2.8.2 pip : 21.0.1 setuptools : 52.0.0.post20210125 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : 7.26.0 pandas_datareader: None bs4 : None bottleneck : 1.3.2 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

JulianKlug commented 2 years ago

Gist with more complete example: https://gist.github.com/JulianKlug/065c7f9918b5e5551e78b39e72375282

JulianKlug commented 2 years ago

For everyone simply looking the solution for this issue, it can be prevented by converting to string first, and then converting back to float, or only converting elements of type str.

df['value'] = df['value'].astype(str).str.replace("'", "").astype(float)
# OR
df['value'] = df['value'].apply(lambda x: x.replace("'",'') if type(x) is str else x)

nik1097 commented 2 years ago

Hey @JulianKlug could I please try working on this fix?

JulianKlug commented 2 years ago

@nik1097 : Sure! Although I am not part of the maintainers, any PR is welcome and we can then discuss the proposal from there

pandas-dev / pandas