Open wchengit opened 2 years ago
Simplifying the example, agreed that it would be nice It would be better behavior if In[7]
was a no-op
In [1]: df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...: born=[pd.NaT, pd.Timestamp('1939-05-27'),
...: pd.Timestamp('1940-04-25')],
...: name=['Alfred', 'Batman', ''],
...: toy=[None, 'Batmobile', 'Joker']))
In [3]: df = df.replace({np.nan: None})
In [5]: df
Out[5]:
age born name toy
0 5.0 None Alfred None
1 6.0 1939-05-27 00:00:00 Batman Batmobile
2 None 1940-04-25 00:00:00 Joker
In [7]: df = df.replace({np.nan: None})
In [8]: df
Out[8]:
age born name toy
0 5.0 NaT Alfred None
1 6.0 1939-05-27 Batman Batmobile
2 NaN 1940-04-25 Joker
Thanks for looking into it. The issue is more than df.replace({np.nan: None}) not working. It acutally has an odd behavior that for the same dataframe, if replaced by odd times, the result is correct. If replaced by even times, it's back the the original, with NaT and NaN.
I just ran into this problem today, which took a while to track down. Glad it's not just me being affected by this
Perhaps this has to do with the pandas feature and is not a problem. In Python, NaN is of the float type, and None is of the NoneType type. For the data frame column age of the float type that contains NaN, the type of the age column changes from float to object after the df.replace({np.nan: None}) command is executed successfully. When the command is executed again, pandas automatically identifies the type in the age column, determines that the type is float, and fills the value of None with NaN by default. The preceding is the reason why NaN is replaced with None for odd times and then is replaced with NaN again for even times.
This a bit annoying if the df = df.replace({np.nan: None})
needs to be run a second time because new columns have been added to the dataframe, example:
In [1]: import pandas as pd
...: import numpy as np
...: import random
In [2]: df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...: born=[pd.NaT, pd.Timestamp('1939-05-27'), pd.Timestamp('1940-04-25')],
...: name=['Alfred', 'Batman', ''],
...: toy=[None, 'Batmobile', 'Joker']))
In [3]: df = df.replace({np.nan: None})
...: df
Out[3]:
age born name toy
0 5.0 None Alfred None
1 6.0 1939-05-27 00:00:00 Batman Batmobile
2 None 1940-04-25 00:00:00 Joker
New columns added with apply:
In [4]: def do_stuff(age):
...: return age
...:
...: df['new_col'] = df['age'].apply(do_stuff)
In [5]: df
Out[5]:
age born name toy new_col
0 5.0 None Alfred None 5.0
1 6.0 1939-05-27 00:00:00 Batman Batmobile 6.0
2 None 1940-04-25 00:00:00 Joker NaN
running df = df.replace({np.nan: None})
the second time flips the original None to NaN while changing the new NaNs to None so the values can't ever be in sync
In [6]: df = df.replace({np.nan: None})
...: df
Out[6]:
age born name toy new_col
0 5.0 NaT Alfred None 5.0
1 6.0 1939-05-27 Batman Batmobile 6.0
2 NaN 1940-04-25 Joker None
running it a third time flips eveything again:
In [7]: df = df.replace({np.nan: None})
...: df
Out[7]:
age born name toy new_col
0 5.0 None Alfred None 5.0
1 6.0 1939-05-27 00:00:00 Batman Batmobile 6.0
2 None 1940-04-25 00:00:00 Joker NaN
I found that, in this df = df.where(pd.notnull(df), None) pattern, if the nan in columns is type of 'float' then the pattern works, it the nan is type of numpy.float64 then it not works. I just want to replace all the nan (what ever the type) to None to insert this df to database.
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[x] I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
As shown in the example above: dataframe.replace({np.nan: None}) works when it's applied to a dataframe once, 3 times, but not working (or "reverse" the change) when run twice, 4 times, etc.
Expected Behavior
It should consistently have the result dataframe containing no np.nan, doesn't matter how many replace attempts have been made.
Installed Versions
INSTALLED VERSIONS
commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.8.12.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-240.1.1.el8_3.x86_64 Version : #1 SMP Fri Oct 16 13:36:46 EDT 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8
pandas : 1.3.4 numpy : 1.21.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.4 setuptools : 58.0.4 Cython : None pytest : 6.2.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.2 IPython : None pandas_datareader: None bs4 : None bottleneck : 1.3.2 fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : 2.7.3 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : 1.4.22 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None