Open jessestone7 opened 1 month ago
Thanks for the report, confirmed on main. PRs to fix are welcome!
It's because the dtype changed : int64->float64->int64
To combine with 'b', NaN need to be added to 'a', so to include NaN, 'int64' was promoted to 'float64' by below flow:
combine_first core/frame.py:8785 combine core/frame.py:8644 align core/generic.py:9447 _align_frame core/generic.py:9524 _reindex_with_indexers core/generic.py:5423 reindex_indexer core/internals/managers.py:832 take_nd core/internals/blocks.py:1020 take_nd core/array_algos/take.py:115 _take_nd_ndarray core/array_algos/take.py:131 _take_preprocess_indexer_and_fill_value core/array_algos/take.py:531 maybe_promote core/dtypes/cast.py:589
IMHO, it's a genaral practice, and converting this 'float64' back to 'int64' seems not natural.
So, I think making the result 'float64' could be a solution. WDYT?
Just for reference> if NaN exists from the first, it could be handled as 'int' :
a = pd.Series([1666880195890293744, 5,pd.NA]).to_frame() b = pd.Series([6, 7, 8]).to_frame() a.combine_first(b)
So, I think making the result 'float64' could be a solution. WDYT?
It seems to me we should be able to carry out this operation without passing through floats.
The cause of 'passing through floats' is, it tries to insert NaN , while converting 'a' from len:2 to len:3. In this case, should we insert other values (like 0 ) to keep the int64 type ?
I am merely a user of Pandas, and the underlying code is far over my head, so maybe I should not be commenting here, but I wonder would it be possible to use here whatever solution was used to solve issue #39051 ?
The above patch could solve this issue ( when all the columns are 'int64' ), but could not cover mixed case like below:
a = pd.DataFrame({0:pd.Series([1666880195890293744, 5]),1:pd.Series([1.0,2.0])})
b = pd.Series([6, 7, 8]).to_frame()
a.combine_first(b)
Currently, NDFrame::align simply get just 1 fill_value as a parameter. IMHO, to solve above case, we need to pass more context as parameters. Is this a right way?
It seems this needs more general approach, and it would not suitable for a newbie like me. So, I'd like to close the PR, hoping someone would take this issue.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I tried this on Pandas version 2.2.2 and I see that there is a loss of precision. This could be related to issue #51777.
Expected Behavior
1666880195890293744 should not get changed to 1666880195890293760
Installed Versions