pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.7k stars 17.91k forks source link

BUG: read_sql_query duplicates column names in cells in pandas v2.0.0 #52437

Open raj-patra opened 1 year ago

raj-patra commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
import pymssql

conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)

result_set = pd.read_sql_query("select * from table", conn)
print(result_set.head())

Issue Description

In the latest version of pandas (v2.0.0), the read_sql_query function does not seem to work.

I am using a pymssql connection and when executing a query, the result set is fetched and the number of rows are intact but the column names are duplicated in the cell values.

The functionality works as expected in the v1.5.3 version

The result set looks like this with v2.0.0 (Unexpected Behavior)

Column 1 Column 2 Column 3
Column 1 Column 2 Column 3
Column 1 Column 2 Column 3
Column 1 Column 2 Column 3

Expected Behavior

Expected Behavior

Column 1 Column 2 Column 3
1 4 7
2 5 8
3 6 9

Installed Versions

INSTALLED VERSIONS ------------------ commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.10.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22000 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_India.1252 pandas : 2.0.0 numpy : 1.24.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 58.1.0 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.0.9 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader: None bs4 : 4.12.0 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
mroeschke commented 1 year ago

Could you give a reproducible example using sqllite ":memory:" database?

raj-patra commented 1 year ago

Looks like the bug is not reproducible in sqllite connections. I tested with sqllite memory database, and the result was an expected behavior.

My guess is that any DBAPI2 connection other than sqllite will have the same unexpected behavior.

import sqlite3
import pandas as pd

conn = sqlite3.connect('file:cachedb?mode=memory&cache=shared')

df = pd.DataFrame([{"A":1,"B":2}, {"A":3,"B":4}, {"A":1,"B":2}, {"A":3,"B":4}])
print(df)

df.to_sql("test", conn, if_exists="replace")

resp = pd.read_sql("select * from test", conn)
print(resp)
phofl commented 1 year ago

You can also provide a reproducible example for mysql/postgress

raj-patra commented 1 year ago

Okay, so after an eternity of digging in, I found out that the issue is with the as_dict=True parameter of the package pymssql

With the following connection, pandas v2.0.0 gives unexpected results as described in the description.

conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)

With the following connection, pandas v2.0.0 gives expected results as described in the description.

conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)

I have no clue as to why that might affect the result set but now we know what exactly is the issue.

I will leave it up to the maintainers to decide if we have to make any changes to the pandas code or mention it somewhere in the documentation that the as_dict=True will result in unexpected results.

I rest my case. ✌️

raj-patra commented 1 year ago

I can add few more examples for what works and what does not...

WORKS

conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)
with conn.cursor() as cursor():
    cursor.execute(QUERY)
    res = pd.DataFrame(cursor.fetchall())

conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)
res = pd.read_sql_query(QUERY, conn)

DOESN'T WORK

conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)
with conn.cursor() as cursor():
    cursor.execute(QUERY)
    res = pd.DataFrame(cursor.fetchall())

conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)
res = pd.read_sql_query(QUERY, conn)
raj-patra commented 1 year ago

@phofl @mroeschke Any updates on this?

Is the team gonna look at the issue? Let me know if there is any more clarification needed.

phofl commented 1 year ago

Investigations are welcome

raj-patra commented 1 year ago

@phofl

Upon some investigation, I found out that the bug might be in the following function: https://github.com/pandas-dev/pandas/blob/bd5ed2f6e2ef8713567b5c731a00e74d93caaf64/pandas/io/sql.py#L166

In version v1.5.x, the _wrap_result function directly wrapped list of dicts into a dataframe with the from_records function, https://github.com/pandas-dev/pandas/blob/2e218d10984e9919f0296931d92ea851c6a6faf5/pandas/io/sql.py#L146

But with version v2.0.0 it calls a newly added function, _convert_arrays_to_dataframe in the following line: https://github.com/pandas-dev/pandas/blob/bd5ed2f6e2ef8713567b5c731a00e74d93caaf64/pandas/io/sql.py#L176

This function uses lib.to_object_array_tuples function which according to its documentation does the following "Convert a list of tuples into an object array. Any subclass of tuple in rows will be casted to tuple." https://github.com/pandas-dev/pandas/blob/bd5ed2f6e2ef8713567b5c731a00e74d93caaf64/pandas/io/sql.py#L148

What it fails to consider is the data that this lib function receives might not be a list of tuples but instead it could be list of dicts. Therefore, it fails to return correct data when a list of dicts is passed to it that comes from the cursor.fetchall() here: https://github.com/pandas-dev/pandas/blob/bd5ed2f6e2ef8713567b5c731a00e74d93caaf64/pandas/io/sql.py#L2303

I am almost certain this occurs in the SQLiteDatabase class which is also a fallback when SQL Alchemy connector is missing. (for example, DBAPI2 connections with result sets as lists of dicts)

Let me know if this paints a correct picture because I have never done this before. I hope this helps. 🙂

pwillen commented 1 year ago

Hi, just wanted to let you know this is still an issue with sqlite as well. I usually use a row factory to return each row as a dict. Reproducible example using sqllite ":memory:" database:

import pandas as pd

def sqlite3_factory(cursor: Any, row: Any) -> Any:
    """ Convert SQL into a Dict rather than Tuple
    :param cursor:
    :param row:
    :return:
    """
    d = {}
    for idx, col in enumerate(cursor.description):
        d[col[0]] = row[idx]
    return d

conn = sqlite3.connect(':memory:')
conn.row_factory = sqlite3_factory

my_dict = [{'name': 'foo'}, {'name': 'bar'}]

df = pd.DataFrame(my_dict)
df.to_sql(name='example', con=conn, if_exists='replace', index=False)
df = pd.read_sql_query("SELECT * FROM example", conn)
Expected Output: name
foo
bar
Actual Output: name
name
name

[!NOTE] Using sqlite3.Row is an option but loses dictionary functionality

macllc2021 commented 5 months ago

Has this been resolved? Stumbled upon the same problem this morning.

fabbber commented 4 months ago

Still open? I'm facing a issue with this as well

samueldy commented 4 months ago

Want to report that I am currently facing the same issue with libsqlite=3.46.0 and pandas=2.2.2 from the conda-forge channel.

asishm commented 4 months ago

@fabbber @samueldy Are you both using cursors/connections that return dicts? Can you please confirm if you're using stdlib sqlite3 library for the connection or a sqlalchemy string/connectable? The docs mention:

Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.

If not, then please create a separate issue. The issue is still open. PRs welcome to fix.

see also: #53028

samueldy commented 4 months ago

My use cases are all with the standard sqlite3 library (import sqlite3) and with pd.read_sql. I am operating with a sqlite3 database stored on disk and am not using pd.read_sql_table or SQLAlchemy.

asishm commented 4 months ago

Are you using a dict factory/cursor with the sqlite connection? If not, can you produce a copy-pastable example that reproduces what you're seeing? @samueldy

samueldy commented 4 months ago

Yes, now that I look at it, using a dict factory does produce the error. I'm wrapping sqlite3.connect like this:

# Convenience context handler for operations on the database
# Return records as dictionaries.
# Thanks to https://stackoverflow.com/a/3300514

def dict_factory(cursor, row):
    d = {}
    for idx, col in enumerate(cursor.description):
        d[col[0]] = row[idx]
    return d

class SqliteDatabase:
    def __init__(self, db_path: str):
        self.db_path = db_path

        self.connection = sqlite3.connect(database=db_path)
        self.connection.row_factory = dict_factory
        self.cursor = self.connection.cursor()

    def __enter__(self):
        return self.connection, self.cursor

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.connection.commit()
        self.connection.close()

then fetch data like this:

with SqliteDatabase(db_path=my_db_path) as (
    conn,
    cur,
):
    ase_interface.show_entire_df(
        pd.read_sql("SELECT * from STOP ORDER BY StopID DESC LIMIT 5", con=conn)
    )

With the connection row factory enabled, I get the column names as values:

StopID Name DivisionID
0 StopID Name DivisionID
1 StopID Name DivisionID

Omitting the self.connection.row_factory = dict_factory line above (or using the same code with pandas=1.1.3) gives the expected result:

StopID Name DivisionID
0 2 Octave 4 2
1 1 Flute 8 1
samueldy commented 4 months ago

Yet having the row factory enabled in pandas=2.2.2 still returns the correct results with cursor.fetchall:

with SqliteDatabase(db_path=my_db_path) as (
    conn,
    cur,
):
    cur.execute("SELECT * from STOP ORDER BY StopID DESC LIMIT 5")
    print(cur.fetchall())
[{'StopID': 2, 'Name': 'Octave 4', 'DivisionID': 2}, {'StopID': 1, 'Name': 'Flute 8', 'DivisionID': 1}]
asishm commented 4 months ago

@samueldy Since you're already using a dict factory, you could replace your pd.read_sql call with pd.DataFrame.from_records(cur.fetchall())

The change was introduced in #50048 changing

frame = DataFrame.from_records(data, columns=columns, coerce_float=coerce_float) to frame = _convert_arrays_to_dataframe(data, columns, coerce_float, dtype_backend)

which assumes a list of tuples (and iterating over a dict gives you the keys instead of the values).

One way to address this might be to revert back and add frame.convert_dtypes(dtype_backed=...) to keep the dtype_backend enhancement but would likely be less performant.

Another alternative could be checking for list of dicts and handling those, but you could theoretically replace the row_factory with anything. Maybe we restrict to assuming list of tuples/dicts and state that in the doc

@phofl if you have any thoughts

samueldy commented 4 months ago

Sounds good. As an end user I have no issue with pd.read_sql/pd.read_sql_query assuming a list of tuples as long as it's in the docs.

asishm commented 4 months ago

To be a bit more specific, the current behavior only applies to DBAPI2 connections used directly. If a sqlalchemy connection/string is used, then it follows a different path.