Open raj-patra opened 1 year ago
Could you give a reproducible example using sqllite ":memory:"
database?
Looks like the bug is not reproducible in sqllite connections. I tested with sqllite memory database, and the result was an expected behavior.
My guess is that any DBAPI2 connection other than sqllite will have the same unexpected behavior.
import sqlite3
import pandas as pd
conn = sqlite3.connect('file:cachedb?mode=memory&cache=shared')
df = pd.DataFrame([{"A":1,"B":2}, {"A":3,"B":4}, {"A":1,"B":2}, {"A":3,"B":4}])
print(df)
df.to_sql("test", conn, if_exists="replace")
resp = pd.read_sql("select * from test", conn)
print(resp)
You can also provide a reproducible example for mysql/postgress
Okay, so after an eternity of digging in, I found out that the issue is with the as_dict=True
parameter of the package pymssql
With the following connection, pandas v2.0.0 gives unexpected results as described in the description.
conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)
With the following connection, pandas v2.0.0 gives expected results as described in the description.
conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)
I have no clue as to why that might affect the result set but now we know what exactly is the issue.
I will leave it up to the maintainers to decide if we have to make any changes to the pandas code or mention it somewhere in the documentation that the as_dict=True
will result in unexpected results.
I rest my case. ✌️
I can add few more examples for what works and what does not...
WORKS
conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)
with conn.cursor() as cursor():
cursor.execute(QUERY)
res = pd.DataFrame(cursor.fetchall())
conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)
res = pd.read_sql_query(QUERY, conn)
DOESN'T WORK
conn = pymssql.connect(user=user, password=password, host=host, database=db, autocommit=True)
with conn.cursor() as cursor():
cursor.execute(QUERY)
res = pd.DataFrame(cursor.fetchall())
conn = pymssql.connect(user=user, password=password, host=host, database=db, as_dict=True, autocommit=True)
res = pd.read_sql_query(QUERY, conn)
@phofl @mroeschke Any updates on this?
Is the team gonna look at the issue? Let me know if there is any more clarification needed.
Investigations are welcome
@phofl
Upon some investigation, I found out that the bug might be in the following function: https://github.com/pandas-dev/pandas/blob/bd5ed2f6e2ef8713567b5c731a00e74d93caaf64/pandas/io/sql.py#L166
In version v1.5.x, the _wrap_result
function directly wrapped list of dicts into a dataframe with the from_records
function,
https://github.com/pandas-dev/pandas/blob/2e218d10984e9919f0296931d92ea851c6a6faf5/pandas/io/sql.py#L146
But with version v2.0.0 it calls a newly added function, _convert_arrays_to_dataframe
in the following line:
https://github.com/pandas-dev/pandas/blob/bd5ed2f6e2ef8713567b5c731a00e74d93caaf64/pandas/io/sql.py#L176
This function uses lib.to_object_array_tuples
function which according to its documentation does the following
"Convert a list of tuples into an object array. Any subclass of tuple in rows
will be casted to tuple."
https://github.com/pandas-dev/pandas/blob/bd5ed2f6e2ef8713567b5c731a00e74d93caaf64/pandas/io/sql.py#L148
What it fails to consider is the data that this lib function receives might not be a list of tuples but instead it could be list of dicts. Therefore, it fails to return correct data when a list of dicts is passed to it that comes from the cursor.fetchall()
here:
https://github.com/pandas-dev/pandas/blob/bd5ed2f6e2ef8713567b5c731a00e74d93caaf64/pandas/io/sql.py#L2303
I am almost certain this occurs in the SQLiteDatabase
class which is also a fallback when SQL Alchemy connector is missing. (for example, DBAPI2 connections with result sets as lists of dicts)
Let me know if this paints a correct picture because I have never done this before. I hope this helps. 🙂
Hi, just wanted to let you know this is still an issue with sqlite as well. I usually use a row factory to return each row as a dict.
Reproducible example using sqllite ":memory:"
database:
import pandas as pd
def sqlite3_factory(cursor: Any, row: Any) -> Any:
""" Convert SQL into a Dict rather than Tuple
:param cursor:
:param row:
:return:
"""
d = {}
for idx, col in enumerate(cursor.description):
d[col[0]] = row[idx]
return d
conn = sqlite3.connect(':memory:')
conn.row_factory = sqlite3_factory
my_dict = [{'name': 'foo'}, {'name': 'bar'}]
df = pd.DataFrame(my_dict)
df.to_sql(name='example', con=conn, if_exists='replace', index=False)
df = pd.read_sql_query("SELECT * FROM example", conn)
Expected Output: | name |
---|---|
foo | |
bar |
Actual Output: | name |
---|---|
name | |
name |
[!NOTE] Using sqlite3.Row is an option but loses dictionary functionality
Has this been resolved? Stumbled upon the same problem this morning.
Still open? I'm facing a issue with this as well
Want to report that I am currently facing the same issue with libsqlite=3.46.0
and pandas=2.2.2
from the conda-forge
channel.
@fabbber @samueldy Are you both using cursors/connections that return dicts? Can you please confirm if you're using stdlib sqlite3
library for the connection or a sqlalchemy string/connectable? The docs mention:
Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.
If not, then please create a separate issue. The issue is still open. PRs welcome to fix.
see also: #53028
My use cases are all with the standard sqlite3 library (import sqlite3
) and with pd.read_sql
. I am operating with a sqlite3 database stored on disk and am not using pd.read_sql_table
or SQLAlchemy.
Are you using a dict factory/cursor with the sqlite connection? If not, can you produce a copy-pastable example that reproduces what you're seeing? @samueldy
Yes, now that I look at it, using a dict factory does produce the error. I'm wrapping sqlite3.connect
like this:
# Convenience context handler for operations on the database
# Return records as dictionaries.
# Thanks to https://stackoverflow.com/a/3300514
def dict_factory(cursor, row):
d = {}
for idx, col in enumerate(cursor.description):
d[col[0]] = row[idx]
return d
class SqliteDatabase:
def __init__(self, db_path: str):
self.db_path = db_path
self.connection = sqlite3.connect(database=db_path)
self.connection.row_factory = dict_factory
self.cursor = self.connection.cursor()
def __enter__(self):
return self.connection, self.cursor
def __exit__(self, exc_type, exc_val, exc_tb):
self.connection.commit()
self.connection.close()
then fetch data like this:
with SqliteDatabase(db_path=my_db_path) as (
conn,
cur,
):
ase_interface.show_entire_df(
pd.read_sql("SELECT * from STOP ORDER BY StopID DESC LIMIT 5", con=conn)
)
With the connection row factory enabled, I get the column names as values:
StopID | Name | DivisionID | |
---|---|---|---|
0 | StopID | Name | DivisionID |
1 | StopID | Name | DivisionID |
Omitting the self.connection.row_factory = dict_factory
line above (or using the same code with pandas=1.1.3
) gives the expected result:
StopID | Name | DivisionID | |
---|---|---|---|
0 | 2 | Octave 4 | 2 |
1 | 1 | Flute 8 | 1 |
Yet having the row factory enabled in pandas=2.2.2
still returns the correct results with cursor.fetchall
:
with SqliteDatabase(db_path=my_db_path) as (
conn,
cur,
):
cur.execute("SELECT * from STOP ORDER BY StopID DESC LIMIT 5")
print(cur.fetchall())
[{'StopID': 2, 'Name': 'Octave 4', 'DivisionID': 2}, {'StopID': 1, 'Name': 'Flute 8', 'DivisionID': 1}]
@samueldy Since you're already using a dict factory, you could replace your pd.read_sql
call with pd.DataFrame.from_records(cur.fetchall())
The change was introduced in #50048 changing
frame = DataFrame.from_records(data, columns=columns, coerce_float=coerce_float)
to
frame = _convert_arrays_to_dataframe(data, columns, coerce_float, dtype_backend)
which assumes a list of tuples (and iterating over a dict gives you the keys instead of the values).
One way to address this might be to revert back and add frame.convert_dtypes(dtype_backed=...)
to keep the dtype_backend
enhancement but would likely be less performant.
Another alternative could be checking for list of dicts and handling those, but you could theoretically replace the row_factory with anything. Maybe we restrict to assuming list of tuples/dicts and state that in the doc
@phofl if you have any thoughts
Sounds good. As an end user I have no issue with pd.read_sql
/pd.read_sql_query
assuming a list of tuples as long as it's in the docs.
To be a bit more specific, the current behavior only applies to DBAPI2 connections used directly. If a sqlalchemy connection/string is used, then it follows a different path.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
In the latest version of pandas (v2.0.0), the
read_sql_query
function does not seem to work.I am using a
pymssql
connection and when executing a query, the result set is fetched and the number of rows are intact but the column names are duplicated in the cell values.The functionality works as expected in the
v1.5.3
versionThe result set looks like this with
v2.0.0
(Unexpected Behavior)Expected Behavior
Expected Behavior
Installed Versions