pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.2k stars 17.77k forks source link

BUG: merge_asof on float does not detect float equality #59187

Open RaphaelRosset opened 2 months ago

RaphaelRosset commented 2 months ago

Pandas version checks

Reproducible Example

left_df = pd.DataFrame({"CONTINUOUS": list(np.arange(100.0, 106.0, 0.8))})
data = [
    (98.0, 103.4, "A", "a"),
    (104.0, 106.8, "B", "b"),
]
right_df = pd.DataFrame(data, columns=["START", "END", "C1", "C2"])
output = pd.merge_asof(
    left_df.sort_values(by="CONTINUOUS"),
    right_df.sort_values(by="START"),
    left_on="CONTINUOUS",
    right_on="START",
    direction="backward",
)
print(output.to_string())

Issue Description

When merge_asof is done with On key being a float number sometimes error due to float precision issue can be seen

CONTINUOUS START END C1 0 100.0 98.0 103.4 A 1 100.8 98.0 103.4 A 2 101.6 98.0 103.4 A 3 102.4 98.0 103.4 A 4 103.2 98.0 103.4 A 5 104.0 98.0 103.4 A <=== 104.0 mapped to A is wrong 6 104.8 104.0 106.8 B 7 105.6 104.0 106.8 B

Note that changing the way CONTINOUS list of values is created makes the output as expected. For example on my side, these 2 alternative give correct output. [103.2, 104.0, 104.8] ==> correct output list(np.arange(103.2, 106.0, 0.8)) ==> correct output

Expected Behavior

CONTINUOUS START END C1 0 100.0 98.0 103.4 A 1 100.8 98.0 103.4 A 2 101.6 98.0 103.4 A 3 102.4 98.0 103.4 A 4 103.2 98.0 103.4 A 5 104.0 98.0 103.4 B <=== 104.0 mapped to B is correct 6 104.8 104.0 106.8 B 7 105.6 104.0 106.8 B

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.2.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : Norwegian Bokmål_Norway.1252 pandas : 2.2.2 numpy : 2.0.0 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : None pip : 24.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
abhinav-thimma commented 1 month ago

Hi @RaphaelRosset , I was triaging your issue on the main branch. Here are my findings:

To make debugging easy, I've set a larger precision for printing numpy.float64's using: np.set_printoptions(precision=15)

Using the Reproducible Example:

(Pdb) left_df
   CONTINUOUS
0       100.0
1       100.8
2       101.6
3       102.4
4       103.2
5       104.0
6       104.8
7       105.6
(Pdb) left_df.iloc[5]
CONTINUOUS    104.0
Name: 5, dtype: float64
(Pdb) left_df.iloc[5]["CONTINUOUS"]
103.99999999999999

If you are not able to control the float numbers in your dataframe, might I recommend setting direction=nearest. Nearest direction produces the following merged dataframe in your case:

   CONTINUOUS  START    END C1 C2
0       100.0   98.0  103.4  A  a
1       100.8   98.0  103.4  A  a
2       101.6  104.0  106.8  B  b
3       102.4  104.0  106.8  B  b
4       103.2  104.0  106.8  B  b
5       104.0  104.0  106.8  B  b
6       104.8  104.0  106.8  B  b
7       105.6  104.0  106.8  B  b

numpy.finfo can be used to understand how floats are stored on your machine.

RaphaelRosset commented 1 month ago

Hi @abhinav-thimma, yes I fully understand that this issue comes from floating point precision handling and then 104.0 is stored as 103.99999999999999 creating then the issue when looking for closest value less than or equal.

In my use case I cannot switch to nearest because if my value was really 103.99... then mapping to key 98.0 would be correct. So the result you get using nearest is not the one I expect (see my expected behavior output)

I have currently implemented a workaround to fix this by creating new versions of CONTINUOUS and START columns of the 2 dataframes into integer columns as precision I need which are then used by merge_asof. This is the code of same example using this workaround:

left_df = pd.DataFrame({"CONTINUOUS": list(np.arange(100.0, 106.0, 0.8))})
data = [
    (98.0, 103.4, "A", "a"),
    (104.0, 106.8, "B", "b"),
]
right_df = pd.DataFrame(data, columns=["START", "END", "C1", "C2"])

left_df["CONTINUOUS_integer"] = left_df["CONTINUOUS"].round(6) * 1000000
right_df["START_integer"] = right_df["START"].round(6) * 1000000

output = pd.merge_asof(
    left_df.sort_values(by="CONTINUOUS_integer"),
    right_df.sort_values(by="START_integer"),
    left_on="CONTINUOUS_integer",
    right_on="START_integer",
    direction="backward",
)

output = output.drop(
    columns=[
        "CONTINUOUS_integer",
        "START_integer",
    ]
)

print(output.to_string())