pandas-dev / pandas-stubs

Public type stubs for pandas
BSD 3-Clause "New" or "Revised" License
234 stars 125 forks source link

TypeIs annotation for `pands.isna` is not quite accurate #1009

Open steve-mavens opened 1 month ago

steve-mavens commented 1 month ago

Describe the bug

Recently https://github.com/pandas-dev/pandas-stubs/pull/972 introduced a TypeIs annotation for pd.isna. The annotation states that the function returns True if and only if the input value is one of the types NaTType | NAType | None. However, this is not correct, since the function also returns True for the value float('nan'), which is of type float.

To Reproduce

  1. Provide a minimal runnable pandas example that is not properly checked by the stubs.
from pandas import isna

def main() -> None:
   foo: float = float('nan')
   if isna(foo) and 1 == 1:
       print('not a number')
   else:
       print('is a number')

if __name__ == '__main__':
    main()

The output is not a number, but because of the TypeIs annotation, mypy thinks that line is unreachable

  1. Indicate which type checker you are using (mypy or pyright).

mypy 1.10.1

  1. Show the error message received from that type checker while checking your example.

type_test.py:6: error: Right operand of "and" is never evaluated [unreachable] if isna(foo) and 1 == 1: ^~ type_test.py:7: error: Statement is unreachable [unreachable] print('not a number') ^~~~~ Found 2 errors in 1 file (checked 1 source file)

Please complete the following information:

Additional context Add any other context about the problem here.

gandhis1 commented 1 month ago

Python's type system does not provide a way to discriminate between a regular float and a NaN float. I would argue that the current implementation does the most intuitive thing in most cases at the expense of a relatively rare use case. Anecdotally, I have found None, np.nan, and pd.NA to be more common.

We could update to include the Literal[float('NaN')] in the overloads, but (1) I'm not sure an instantiation like that is actually supported and (2) it may not even solve the issue, if type checkers infer it as a regular float.

steve-mavens commented 1 month ago

Yes, I think the basic problem here is that TypeIs requires a clean division of types. It doesn't allow for a function that takes input type A | B | C and returns True for all A and some values of C, and False for all B and the remaining values of C.

Dr-Irv commented 1 month ago

Thanks for the report. I'm not sure what we can do here, because we can't specify float("nan") or np.nan as a Literal.

steve-mavens commented 1 month ago

One workaround for users would be math.isnan(x) if isinstance(x, float) else pd.isna(x), but obviously there's some performance implications there. It's probably better to suppress the unreachable warning than take the hit.

Dr-Irv commented 1 month ago

We could update to include the Literal[float('NaN')] in the overloads, but (1) I'm not sure an instantiation like that is actually supported and (2) it may not even solve the issue, if type checkers infer it as a regular float.

I thought that might work, but the typing system doesn't support it. Found this discussion: https://github.com/python/typing/issues/1160

Yes, I think the basic problem here is that TypeIs requires a clean division of types. It doesn't allow for a function that takes input type A | B | C and returns True for all A and some values of C, and False for all B and the remaining values of C.

Well, it might work if we could do Literal[float("nan")], since that would "divide" the types, although I'm not sure if the type checkers are handling that.

steve-mavens commented 1 month ago

Also, there are allowed to be float NaN values distinct from float('nan'), so it could still go wrong in even rarer cases.

Is there a way to import pandas.api.typing.NAType (or pandas.libs._missing.NAType) that the stubs approve of? The reason I found this quirk of the type hint is actually because of a bug in my real code. I have a variable taken from an unpacked row of a DataFrame, and I typed it as float, when it could also be pd.NA.

steve-mavens commented 1 month ago

So, another workaround is to make sure variables/expressions that might contain a float NaN are typed as float | NAType, or float | None, so that mypy doesn't think the return value of pd.isna is guaranteed False by the input type.

Dr-Irv commented 1 month ago

Is there a way to import pandas.api.typing.NAType (or pandas.libs._missing.NAType) that the stubs approve of?

Right now, if you do from pandas._libs.missing import NAType , that should work.

We're behind on updating the stubs for all 2.x support, which means that pandas.api.typing doesn't exist yet in the stubs.

steve-mavens commented 1 month ago

Great, thanks for your help. I'll probably use None until NAType is public in the stubs, since either way I just need to tell mypy that pd.isna might return True.

gboeing commented 1 month ago

So, another workaround is to make sure variables/expressions that might contain a float NaN are typed as float | NAType, or float | None, so that mypy doesn't think the return value of pd.isna is guaranteed False by the input type.

I ran into the same issue today, and resolved it by typing my variable like value: float | None