Open michaelpradel opened 4 months ago
I would like to take this up. Working on a fix now
I can reproduce the issue locally and the problem seems to originate from:
The root cause of the issue:
My fix is going to be: Do not replace None with NaN. Or for that matter NaN with None. They should be treated one and the same.
@mroeschke I am tagging you here because you have previously reviewed the code for the combine_first function and I assume you might be maintaining it. I read quite a bit about Pandas convention of representing NA values for this issue and I stumbled upon this StackOverflow which is inspired by the official docs.
The problem arises because Pandas considers None and NaN the same, take for example the isna() or isnull() utility. Based on the resources I read above, Pandas has decided to use NaN as its default NA representation. Based on these points, I feel the best course of action is to convert the None values to NaN and then invoke the combine_first function. This will add O(N) computational overhead given the fillna() operation.
@michaelpradel If you have a better alternative please let me know!
@pandyah5: Your proposed fix sounds good to me, at least for series with (inferred) dtype "object". Note that I'm not a pandas developers, just a user; so please take my opinion on this with a grain of salt.
Hi @michaelpradel thanks for your feedback! Its been a while since this issue has been opened so I'll bring this up in the community and get some maintainers attention here to have their final verdict :)
Update: I have notified the pandas-dev slack channel for some help in reviewing this PR
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The documentation of
Series.combine_first
states:In the above example,
s2
doesn't provide any value at index'a'
, socombine_first
should not affect this index. However, the result with pandas 2.2.2 is the following, which unexpectedly changes the value at'a'
fromNone
toNaN
:Pandas 2.1.4 behaves as expected, so this looks like a regression. The behavior was changed with this PR: https://github.com/pandas-dev/pandas/pull/57034
Expected Behavior
The expected output is:
Installed Versions