pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.7k stars 17.92k forks source link

BUG: Type Annotation Inconsistency in real_sql_* Functions #60085

Open FakeNameSE opened 2 weeks ago

FakeNameSE commented 2 weeks ago

Pandas version checks

Reproducible Example

import pandas as pd 
import sqlite3

date_params = {"date_col": {"utc": True}}

with sqlite3.connect("blah") as con:
    # Fails type check.
    df = pd.read_sql_query("SELECT * FROM tablename", con, parse_dates=date_params)
    print(df)

Issue Description

The pandas type annotations for the parse_dates argument in read_sql_table() and read_sql_query() is overly restrictive. It incorrectly causes type checkers to complain when using the parse_dates argument to pass keyword arguments to to_datetime() as documented here.

To solve this problem, the annotated type for parse_date just needs to be switched from parse_dates: list[str] | dict[str, str] | None to list[str] | dict[str, str] | dict[str, dict[str, Any]] | None.

This problem is not always visible because the corresponding pandas-stubs already does this. The inconsistency appears however in some type checkers when additional stubs are not available or configured though.

To illustrate, take the provided (valid) example and run pyright on it (with no arguments). It will output the following.

(bug_venv)$ pyright example.py
/home/user/Code/pandas_bug/example.py
/home/user/Code/pandas_bug/example.py:8:10 - error: No overloads for "read_sql_query" match the provided arguments (reportCallIssue)
/home/user/Code/pandas_bug/example.py:8:72 - error: Argument of type "dict[str, dict[str, bool]]" cannot be assigned to parameter "parse_dates" of type "list[str] |dict[str, str] | None" in function "read_sql_query"
Type "dict[str, dict[str, bool]]" is not assignable to type "list[str] | dict[str, str] | None"
"dict[str, dict[str, bool]]" is not assignable to "list[str]"
"dict[str, dict[str, bool]]" is not assignable to "dict[str, str]"
Type parameter "_VT@dict" is invariant, but "dict[str, bool]" is not the same as "str"
Consider switching from "dict" to "Mapping" which is covariant in the value type
"dict[str, dict[str, bool]]" is not assignable to "None" (reportArgumentType)
2 errors, 0 warnings, 0 informations

I am more than happy to submit a pull request for this is desired, but thought it best to put in this issue first in case I am missing something.

Expected Behavior

import pandas as pd import sqlite3

date_params = {"date_col": {"utc": True}}

with sqlite3.connect("blah") as con:

Type checks correctly

df = pd.read_sql_query("SELECT * FROM tablename", con, parse_dates=date_params)
print(df) 

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.6 python-bits : 64 OS : Linux OS-release : 6.11.2-arch1-1 Version : #1 SMP PREEMPT_DYNAMIC Fri, 04 Oct 2024 21:51:11 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 2.1.2 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 24.2 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None
rhshadrach commented 6 days ago

Thanks for the report!

This problem is not always visible because the corresponding pandas-stubs already does this. The inconsistency appears however in some type checkers when additional stubs are not available or configured though.

It seems to me this is not appropriate. PEP 561 makes this quite clear I think:

Package maintainers who wish to support type checking of their code MUST add a marker file named py.typed to their package supporting typing.

Since pandas does not have a py.typed file, its type-hints should not be considered public. I only mention this to say that I think pandas should not be obligated to spend unnecessary effort in order to support third parties that use its internal type-hints.

Of course, in cases where the change would benefit pandas internal typing (as is the case here I believe), PRs are welcome!