pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

BUG: Series.combine_first loss of precision #60128

Open jessestone7 opened 1 month ago

jessestone7 commented 1 month ago

Pandas version checks

Reproducible Example

>>> a = pd.Series([1666880195890293744, 5]).to_frame()
>>> b = pd.Series([6, 7, 8]).to_frame()
>>> a
                     0
0  1666880195890293744
1                    5
>>> b
   0
0  6
1  7
2  8
>>> a.combine_first(b)
                     0
0  1666880195890293760
1                    5
2                    8

Issue Description

I tried this on Pandas version 2.2.2 and I see that there is a loss of precision. This could be related to issue #51777.

Expected Behavior

1666880195890293744 should not get changed to 1666880195890293760

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.9.final.0 python-bits : 64 OS : Darwin OS-release : 24.0.0 Version : Darwin Kernel Version 24.0.0: Tue Sep 24 23:37:36 PDT 2024; root:xnu-11215.1.12~1/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.26.4 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 67.3.2 pip : 24.2 Cython : 0.29.33 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.0 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.6.3 numba : None numexpr : 2.8.4 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.10.0 sqlalchemy : None tables : 3.8.0 tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
rhshadrach commented 1 month ago

Thanks for the report, confirmed on main. PRs to fix are welcome!

cooolheater commented 4 weeks ago

It's because the dtype changed : int64->float64->int64

To combine with 'b', NaN need to be added to 'a', so to include NaN, 'int64' was promoted to 'float64' by below flow:

combine_first core/frame.py:8785 combine core/frame.py:8644 align core/generic.py:9447 _align_frame core/generic.py:9524 _reindex_with_indexers core/generic.py:5423 reindex_indexer core/internals/managers.py:832 take_nd core/internals/blocks.py:1020 take_nd core/array_algos/take.py:115 _take_nd_ndarray core/array_algos/take.py:131 _take_preprocess_indexer_and_fill_value core/array_algos/take.py:531 maybe_promote core/dtypes/cast.py:589

IMHO, it's a genaral practice, and converting this 'float64' back to 'int64' seems not natural.

So, I think making the result 'float64' could be a solution. WDYT?

Just for reference> if NaN exists from the first, it could be handled as 'int' :

a = pd.Series([1666880195890293744, 5,pd.NA]).to_frame() b = pd.Series([6, 7, 8]).to_frame() a.combine_first(b)

rhshadrach commented 4 weeks ago

So, I think making the result 'float64' could be a solution. WDYT?

It seems to me we should be able to carry out this operation without passing through floats.

cooolheater commented 4 weeks ago

The cause of 'passing through floats' is, it tries to insert NaN , while converting 'a' from len:2 to len:3. In this case, should we insert other values (like 0 ) to keep the int64 type ?

jessestone7 commented 4 weeks ago

I am merely a user of Pandas, and the underlying code is far over my head, so maybe I should not be commenting here, but I wonder would it be possible to use here whatever solution was used to solve issue #39051 ?

cooolheater commented 4 weeks ago

The above patch could solve this issue ( when all the columns are 'int64' ), but could not cover mixed case like below:

a = pd.DataFrame({0:pd.Series([1666880195890293744, 5]),1:pd.Series([1.0,2.0])})
b = pd.Series([6, 7, 8]).to_frame()
a.combine_first(b)

Currently, NDFrame::align simply get just 1 fill_value as a parameter. IMHO, to solve above case, we need to pass more context as parameters. Is this a right way?

cooolheater commented 1 week ago

It seems this needs more general approach, and it would not suitable for a newbie like me. So, I'd like to close the PR, hoping someone would take this issue.