pandas-dev / pandas-stubs

Public type stubs for pandas
BSD 3-Clause "New" or "Revised" License
233 stars 123 forks source link

read_csv: parameter usecols has wrong type hint #963

Open clo-vis opened 3 months ago

clo-vis commented 3 months ago

To Reproduce

from collections.abc import Sequence
from pandas import read_csv
cols1: Sequence[str] = ["a"]
def cols2(x: set[float]) -> bool:  return sum(x) < 1.0
read_csv("file.csv", usecols=cols1) # mypy: "No overload variant of "read_csv" matches argument types "str", "Sequence[str]"
read_csv("file.csv", usecols=cols2) # no error from mypy

Expected behavior: No error for usecols=cols1, but an error for usecols=cols2

Actual behavior: cols1is not accepted, even though Sequence[str] is "list like" (at least I think so; the term is nowhere defined) and its elements are "strings that correspond to column names" cols2is accepted, even though it is not a callable that can be "evaluated against the column names, returning names where the callable function evaluates to True"

Please complete the following information:

Dr-Irv commented 3 months ago

There are 2 different issues here:

  1. With respect to cols1, when you declare that cols1 is Sequence[str], that means that cols1 could be a list of strings OR a single string. We explicitly disallow a single string as the usecols argument. So you're better off making cols1 a list[str] .
  2. With respect to cols2, we allow a Callable[[HashableT], bool] and sets are hashable. So that is why your code is accepted.

It may be that the solution is to change this line: https://github.com/pandas-dev/pandas-stubs/blob/48ca4b0d70f0695da6d452f7e203f05ad983913a/pandas-stubs/_typing.pyi#L529

to use SequenceNotStr[str | int] and Callable[[str | int], bool], but I'd like to hear the opinion of @twoertwein on this.

clo-vis commented 3 months ago
>>> print(hash({1.0}))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'set'

sets are not hashable.

clo-vis commented 3 months ago

typing.py

class Hashable(Protocol, metaclass=ABCMeta):
    # TODO: This is special, in that a subclass of a hashable class may not be hashable
    #   (for example, list vs. object). It's not obvious how to represent this. This class
    #   is currently mostly useless for static checking.
    @abstractmethod
    def __hash__(self) -> int: ...
clo-vis commented 3 months ago
from collections.abc import Sequence
from pandas import read_csv

cols1: Sequence[int] = [ 1 ]
read_csv("file_csv", usecols=cols1)

mypy: _error: No overload variant of "readcsv" matches argument types "str", "Sequence[int]" documentation: If list-like, all elements must either be positional (i.e. integer indices into the document columns) or ...

Dr-Irv commented 3 months ago
from collections.abc import Sequence
from pandas import read_csv

cols1: Sequence[int] = [ 1 ]
read_csv("file_csv", usecols=cols1)

mypy: _error: No overload variant of "readcsv" matches argument types "str", "Sequence[int]" documentation: If list-like, all elements must either be positional (i.e. integer indices into the document columns) or ...

That seems to be a mypy error. pyright accepts the above code.

Having said that, pyright accepts the example you had with a Callable, so we'll have to see what (if anything) we can do about that.

clo-vis commented 3 months ago

In strict mode, pyright says _Type of "readcsv" is partially unknown. So if pyright doesn't know everything about read_csv, perhaps it cannot issue an error message either.

Dr-Irv commented 3 months ago

In strict mode, pyright says _Type of "readcsv" is partially unknown. So if pyright doesn't know everything about read_csv, perhaps it cannot issue an error message either.

We only test and support pyright in basic mode. Strict is really hard to make work.

If I change the Callable[[HashableT], bool] to Callable[[Hashable], bool], then pyright in basic mode picks up the error. So a PR with that would solve that problem, at least for pyright. But mypy still doesn't flag the error.

Not sure what to do when mypy has a bug.

clo-vis commented 3 months ago

If I change the Callable[[HashableT], bool] to Callable[[Hashable], bool], then pyright in basic mode picks up the error. So a PR with that would solve that problem,

It would probably create another problem, as

def cols(x: str) -> bool: return x in ["A", "B", "C"]
read_csv("file_csv", usecols=cols)

should be acceptable (and colsis not a Callable[[Hashable], bool])

Dr-Irv commented 3 months ago

should be acceptable (and colsis not a Callable[[Hashable], bool])

Good point. Seems like we need to change it to: Callable[[int], bool] | Callable[[str], bool]

twoertwein commented 3 months ago

to use SequenceNotStr[str | int] and Callable[[str | int], bool], but I'd like to hear the opinion of @twoertwein on this.

I think that could be too strict (but might cover typical usage).

It seems to work as expected on pyright playground when inlining the TypeAlias (I think the issue is that we don't set the generic type of the TypeAlias when using it in the read_csv definition)

Dr-Irv commented 3 months ago

It seems to work as expected on pyright playground when inlining the TypeAlias (I think the issue is that we don't set the generic type of the TypeAlias when using it in the read_csv definition)

So something is going on that is different with what is in pandas-stubs versus the simple test that you wrote.

twoertwein commented 3 months ago

I think it might be sufficient to replace with UsecolArgs with UsecolArgs[HashableT] (pyright in strict mode points this out and says that it therefore assumes "unknown" as the input argument for the Callable part, which explains no error for set).

Dr-Irv commented 3 months ago

I think it might be sufficient to replace with UsecolArgs with UsecolArgs[HashableT] (pyright in strict mode points this out and says that it therefore assumes "unknown" as the input argument for the Callable part, which explains no error for set).

I tried that out locally and it works. So a PR that changes UseColArgsType in pandas-stubs\io\parsers\readers.pyi to UseColArgsType[HashableT] would fix the second issue reported here.

byildiz commented 2 months ago

I have the similar issue with usecols as the type of list of a string.

import pandas as pd
df = pd.read_csv("test.csv", usecols=["a", "b"]) # this line causes type conflict

Pyright gives the following diagnostic message, but according to the pandas docs usecols can be a list of strings. I guess the type annotation of usecols is not accurate.

Diagnostics:
1. No overloads for "read_csv" match the provided arguments [reportCallIssue]
2. Argument of type "list[str]" cannot be assigned to parameter "usecols" of type "UsecolsArgType[Unknown]" in function "read_csv"
     Type "list[str]" is incompatible with type "UsecolsArgType[Unknown]"
       "list[str]" is incompatible with protocol "SequenceNotStr[Hashable]"
         "index" is an incompatible type
           Type "(value: str, start: SupportsIndex = 0, stop: SupportsIndex = sys.maxsize, /) -> int" is incompatible with type "(value: Any, /, start: int = 0, stop: int = ...) -> int"
             Missing keyword parameter "start"
             Missing keyword parameter "stop"
       "list[str]" is incompatible with "range"
       "list[str]" is incompatible with "ExtensionArray"
     ... [reportArgumentType]
Dr-Irv commented 2 months ago

I have the similar issue with usecols as the type of list of a string.

import pandas as pd
df = pd.read_csv("test.csv", usecols=["a", "b"]) # this line causes type conflict

Pyright gives the following diagnostic message, but according to the pandas docs usecols can be a list of strings. I guess the type annotation of usecols is not accurate.

I cannot reproduce this. We explicitly test for lists of strings with usecols, so maybe you are using strict pyright, which might be the cause of your issue.

byildiz commented 2 months ago

@Dr-Irv Thank you for your reply.

When I looked at the Pyright configuration docs, it is stated that for reportArgumentType, all three modes (basic, standard, and strict) have the same reporting severity, which is error. Therefore, I don't think Pyright's type-checking mode is the problem.

On the other hand, when I checked the SequenceNotStr definition, I noticed that the version I have is slightly different from the latest version. I also found this change done 8 months ago:

-    def index(self, value: Any, /, start: int = 0, stop: int = ...) -> int: ...
+    def index(self, value: Any, start: int = ..., stop: int = ..., /) -> int: ...

So, the issue seems to be the version I have. I have still the old version. However, what I don't understand is that I am using the latest version (2.2.2) of Pandas, and I still don't have the change mentioned above. I downloaded the wheel file and inspected it, and it indeed contains the old version of the stub file. This is quite surprising.

I have probed some of the latest pandas packages from both pypi and conda and seen that all of them still contain old version of SequenceNotStr which causes the problem with latest version of pyright.

Dr-Irv commented 2 months ago

I have probed some of the latest pandas packages from both pypi and conda and seen that all of them still contain old version of SequenceNotStr which causes the problem with latest version of pyright.

You need to install pandas-stubs . That will properly type-check your code. It is a separate package.