BUG: `Series.combine_first` replaces `None` with `NaN` at index not provided by `other`

michaelpradel commented 5 months ago

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

s1 = pd.Series([None, None, None], index=['a', 'b', 'c'])
s2 = pd.Series([None, None, None], index=['b', 'c', 'd'])

result = s1.combine_first(s2)
print(result)

Issue Description

The documentation of Series.combine_first states:

Update null elements with value in the same location in ‘other’.

In the above example, s2 doesn't provide any value at index 'a', so combine_first should not affect this index. However, the result with pandas 2.2.2 is the following, which unexpectedly changes the value at 'a' from None to NaN:

a     NaN
b    None
c    None
d    None
dtype: object

Pandas 2.1.4 behaves as expected, so this looks like a regression. The behavior was changed with this PR: https://github.com/pandas-dev/pandas/pull/57034

Expected Behavior

The expected output is:

a    None
b    None
c    None
d    None
dtype: object

Installed Versions

INSTALLED VERSIONS ------------------ commit : 6dbeeb4009bbfac5ea1ae2111346f5e9f05b81f4 python : 3.10.8.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-35-generic Version : #35-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 26 11:23:57 UTC 2024 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.0rc0+28.g6dbeeb4009 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 63.2.0 pip : 24.0 Cython : 3.0.10 pytest : 8.2.0 hypothesis : 6.100.2 sphinx : 7.3.7 blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.2.1 html5lib : 1.1 pymysql : 1.4.6 psycopg2 : 2.9.9 jinja2 : 3.1.3 IPython : 8.24.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.3.8 fastparquet : 2024.2.0 fsspec : 2024.3.1 gcsfs : 2024.3.1 matplotlib : 3.8.4 numba : 0.59.1 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.2 pyarrow : 16.0.0 pyreadstat : 1.2.7 python-calamine : None pyxlsb : 1.0.10 s3fs : 2024.3.1 scipy : 1.13.0 sqlalchemy : 2.0.29 tables : 3.9.2 tabulate : 0.9.0 xarray : 2024.3.0 xlrd : 2.0.1 zstandard : 0.22.0 tzdata : 2024.1 qtpy : None pyqt5 : None

pandyah5 commented 5 months ago

I would like to take this up. Working on a fix now

pandyah5 commented 5 months ago

I can reproduce the issue locally and the problem seems to originate from:

https://github.com/pandas-dev/pandas/blob/bbe0e531383358b44e94131482e122bda43b33d7/pandas/core/series.py#L3149

The root cause of the issue:

After the outer join s1 turns into : None, None, None, NaN and s2 turns into: NaN, None, None, None.
combine_first replaces all values in s1 that satisfy isna(). Since all of them do, it replaces every value with that of s2.
After the join the default value is NaN and not None, as seen in point 1. So when s2 is copied over s1 we get NaN, None, None, None

My fix is going to be: Do not replace None with NaN. Or for that matter NaN with None. They should be treated one and the same.

pandyah5 commented 5 months ago

@mroeschke I am tagging you here because you have previously reviewed the code for the combine_first function and I assume you might be maintaining it. I read quite a bit about Pandas convention of representing NA values for this issue and I stumbled upon this StackOverflow which is inspired by the official docs.

The problem arises because Pandas considers None and NaN the same, take for example the isna() or isnull() utility. Based on the resources I read above, Pandas has decided to use NaN as its default NA representation. Based on these points, I feel the best course of action is to convert the None values to NaN and then invoke the combine_first function. This will add O(N) computational overhead given the fillna() operation.

@michaelpradel If you have a better alternative please let me know!

michaelpradel commented 5 months ago

@pandyah5: Your proposed fix sounds good to me, at least for series with (inferred) dtype "object". Note that I'm not a pandas developers, just a user; so please take my opinion on this with a grain of salt.

pandyah5 commented 4 months ago

Hi @michaelpradel thanks for your feedback! Its been a while since this issue has been opened so I'll bring this up in the community and get some maintainers attention here to have their final verdict :)

Update: I have notified the pandas-dev slack channel for some help in reviewing this PR

pandas-dev / pandas