Open RaphaelRosset opened 2 months ago
Hi @RaphaelRosset , I was triaging your issue on the main branch. Here are my findings:
To make debugging easy, I've set a larger precision for printing numpy.float64's using: np.set_printoptions(precision=15)
Using the Reproducible Example:
(Pdb) left_df
CONTINUOUS
0 100.0
1 100.8
2 101.6
3 102.4
4 103.2
5 104.0
6 104.8
7 105.6
(Pdb) left_df.iloc[5]
CONTINUOUS 104.0
Name: 5, dtype: float64
(Pdb) left_df.iloc[5]["CONTINUOUS"]
103.99999999999999
< 104.0
. (This is a well know behavior of floating-point arithmetic by numpy)direction=backward
, merge_asof will try to find a key in right_df which is less than or equal to the above value (103.99999999999999
) which in this case would be 98.0
from (98.0, 103.4, "A", "a")
If you are not able to control the float numbers in your dataframe, might I recommend setting direction=nearest
. Nearest direction produces the following merged dataframe in your case:
CONTINUOUS START END C1 C2
0 100.0 98.0 103.4 A a
1 100.8 98.0 103.4 A a
2 101.6 104.0 106.8 B b
3 102.4 104.0 106.8 B b
4 103.2 104.0 106.8 B b
5 104.0 104.0 106.8 B b
6 104.8 104.0 106.8 B b
7 105.6 104.0 106.8 B b
numpy.finfo
can be used to understand how floats are stored on your machine.
Hi @abhinav-thimma,
yes I fully understand that this issue comes from floating point precision handling and then 104.0
is stored as 103.99999999999999
creating then the issue when looking for closest value less than or equal.
In my use case I cannot switch to nearest because if my value was really 103.99...
then mapping to key 98.0
would be correct.
So the result you get using nearest is not the one I expect (see my expected behavior output)
I have currently implemented a workaround to fix this by creating new versions of CONTINUOUS and START columns of the 2 dataframes into integer columns as precision I need which are then used by merge_asof. This is the code of same example using this workaround:
left_df = pd.DataFrame({"CONTINUOUS": list(np.arange(100.0, 106.0, 0.8))})
data = [
(98.0, 103.4, "A", "a"),
(104.0, 106.8, "B", "b"),
]
right_df = pd.DataFrame(data, columns=["START", "END", "C1", "C2"])
left_df["CONTINUOUS_integer"] = left_df["CONTINUOUS"].round(6) * 1000000
right_df["START_integer"] = right_df["START"].round(6) * 1000000
output = pd.merge_asof(
left_df.sort_values(by="CONTINUOUS_integer"),
right_df.sort_values(by="START_integer"),
left_on="CONTINUOUS_integer",
right_on="START_integer",
direction="backward",
)
output = output.drop(
columns=[
"CONTINUOUS_integer",
"START_integer",
]
)
print(output.to_string())
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When merge_asof is done with On key being a float number sometimes error due to float precision issue can be seen
CONTINUOUS START END C1 0 100.0 98.0 103.4 A 1 100.8 98.0 103.4 A 2 101.6 98.0 103.4 A 3 102.4 98.0 103.4 A 4 103.2 98.0 103.4 A 5 104.0 98.0 103.4 A <=== 104.0 mapped to A is wrong 6 104.8 104.0 106.8 B 7 105.6 104.0 106.8 B
Note that changing the way CONTINOUS list of values is created makes the output as expected. For example on my side, these 2 alternative give correct output. [103.2, 104.0, 104.8] ==> correct output list(np.arange(103.2, 106.0, 0.8)) ==> correct output
Expected Behavior
CONTINUOUS START END C1 0 100.0 98.0 103.4 A 1 100.8 98.0 103.4 A 2 101.6 98.0 103.4 A 3 102.4 98.0 103.4 A 4 103.2 98.0 103.4 A 5 104.0 98.0 103.4 B <=== 104.0 mapped to B is correct 6 104.8 104.0 106.8 B 7 105.6 104.0 106.8 B
Installed Versions