pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.57k stars 17.56k forks source link

BUG: to_sql does gives incorrect column name for index when callable passed in to method #59112

Open hb2638 opened 4 days ago

hb2638 commented 4 days ago

Pandas version checks

Reproducible Example

import math
import typing
import uuid

import pandas
import sqlalchemy
import sqlalchemy.engine

def insert_callback(table: pandas.io.sql.SQLTable, conn: sqlalchemy.engine.base.Connection, keys: list[str], data_iter: typing.Iterable[tuple]) -> None:
    data = [list(d) for d in data_iter]
    print(f"{data=}")
    print(f"{keys=}")
    print(f"{[c.name for c in table.table.columns]=}")
    sql = f"""
INSERT INTO [{table.name}]({", ".join(f"[{k}]" for k in keys)})
VALUES ({", ".join(f":{i}" for i, _ in enumerate(keys))})
""".strip()
    for values in data:
        conn.execute(sqlalchemy.text(sql), {str(i): v for i, v in enumerate(values)})

input = {
    1: {('A', 'X'): math.nan, ('A', 'Y'): math.nan, ('B', 'X'): math.nan, ('B', 'Y'): math.nan},
    2: {('A', 'X'): math.nan, ('A', 'Y'): math.nan, ('B', 'X'): math.nan, ('B', 'Y'): math.nan},
    3: {('A', 'X'): math.nan, ('A', 'Y'): math.nan, ('B', 'X'): math.nan, ('B', 'Y'): math.nan},
    4: {('A', 'X'): math.nan, ('A', 'Y'): math.nan, ('B', 'X'): math.nan, ('B', 'Y'): math.nan},
    5: {('A', 'X'): math.nan, ('A', 'Y'): math.nan, ('B', 'X'): math.nan, ('B', 'Y'): math.nan},
}
df = pandas.DataFrame.from_dict(input, "index")
conn_url = sqlalchemy.engine.URL.create("mssql+pyodbc", query={"odbc_connect": "DRIVER={ODBC Driver 18 for SQL Server};SERVER=.;Trusted_Connection=yes;TrustServerCertificate=yes"})
engine = sqlalchemy.create_engine(conn_url, echo=True)
with engine.connect() as conn:
    with conn.begin():
        df.to_sql(f"insert_callback_{str(uuid.uuid4())}", conn, index=True, method=insert_callback)
        conn.rollback()

Issue Description

When calling pandas.DataFrame.to_sql with index set to True, it gives the the wrong name for the SQL column used for the index.

E.x.: In the code I provided, it creates a column called index as the index but it passed ('index', '') as the index column name in the list of columns.

Below is the output from sql alchemy

2024-06-26 16:02:23,136 INFO sqlalchemy.engine.Engine SELECT CAST(SERVERPROPERTY('ProductVersion') AS VARCHAR)
2024-06-26 16:02:23,136 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-06-26 16:02:23,137 INFO sqlalchemy.engine.Engine SELECT schema_name()
2024-06-26 16:02:23,137 INFO sqlalchemy.engine.Engine [generated in 0.00014s] ()
2024-06-26 16:02:23,138 INFO sqlalchemy.engine.Engine SELECT CAST('test max support' AS NVARCHAR(max))
2024-06-26 16:02:23,138 INFO sqlalchemy.engine.Engine [generated in 0.00013s] ()
2024-06-26 16:02:23,138 INFO sqlalchemy.engine.Engine SELECT 1 FROM fn_listextendedproperty(default, default, default, default, default, default, default)
2024-06-26 16:02:23,138 INFO sqlalchemy.engine.Engine [generated in 0.00011s] ()
2024-06-26 16:02:23,139 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-06-26 16:02:23,151 INFO sqlalchemy.engine.Engine SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE ([INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) OR [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max))) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max))
2024-06-26 16:02:23,151 INFO sqlalchemy.engine.Engine [generated in 0.00032s] ('BASE TABLE', 'VIEW', 'insert_callback_5ff921a3-15b6-474b-af16-db1d7e243f06', 'dbo')
2024-06-26 16:02:23,153 INFO sqlalchemy.engine.Engine 
CREATE TABLE [insert_callback_5ff921a3-15b6-474b-af16-db1d7e243f06] (
    [index] BIGINT NULL, 
    [('A', 'X')] FLOAT(53) NULL, 
    [('A', 'Y')] FLOAT(53) NULL, 
    [('B', 'X')] FLOAT(53) NULL, 
    [('B', 'Y')] FLOAT(53) NULL
)

2024-06-26 16:02:23,153 INFO sqlalchemy.engine.Engine [no key 0.00014s] ()
2024-06-26 16:02:23,155 INFO sqlalchemy.engine.Engine CREATE INDEX [ix_insert_callback_5ff921a3-15b6-474b-af16-db1d7e243f06_index] ON [insert_callback_5ff921a3-15b6-474b-af16-db1d7e243f06] ([index])
2024-06-26 16:02:23,155 INFO sqlalchemy.engine.Engine [no key 0.00012s] ()
data=[[1, None, None, None, None], [2, None, None, None, None], [3, None, None, None, None], [4, None, None, None, None], [5, None, None, None, None]]
keys=["('index', '')", "('A', 'X')", "('A', 'Y')", "('B', 'X')", "('B', 'Y')"]
[c.name for c in table.table.columns]=['index', "('A', 'X')", "('A', 'Y')", "('B', 'X')", "('B', 'Y')"]
2024-06-26 16:02:23,158 INFO sqlalchemy.engine.Engine INSERT INTO [insert_callback_5ff921a3-15b6-474b-af16-db1d7e243f06]([('index', '')], [('A', 'X')], [('A', 'Y')], [('B', 'X')], [('B', 'Y')])
VALUES (?, ?, ?, ?, ?)
2024-06-26 16:02:23,158 INFO sqlalchemy.engine.Engine [generated in 0.00016s] (1, None, None, None, None)
2024-06-26 16:02:23,160 INFO sqlalchemy.engine.Engine ROLLBACK

Expected Behavior

The sql column name of the index should be in the list of keys/columns provided to the call back.

Installed Versions

import pandas as pd pd.show_versions() INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.1.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 70.1.0 pip : 24.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.2.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : 2024.5.0 fsspec : 2024.6.0 gcsfs : None matplotlib : 3.8.4 numba : None numexpr : None odfpy : None openpyxl : 3.1.3 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.4 sqlalchemy : 2.0.30 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None