BUG: Pandas converts int to float while reading from database if column has Null in it

mkleehammer / pyodbc

Python ODBC bridge

https://github.com/mkleehammer/pyodbc/wiki

MIT No Attribution

2.94k stars 562 forks source link

BUG: Pandas converts int to float while reading from database if column has Null in it #961

Closed AltafHussain4748 closed 2 years ago

AltafHussain4748 commented 3 years ago

I am trying to read a stored procedure by using pandas.read_sql method in Python by using the pyodbc driver. The issue I have is when Pandas reads data, it converts int columns to float which have Null values in it.

v-chojas commented 3 years ago

Could you post an ODBC trace and more information about your environment?

gordthompson commented 3 years ago

Also being discussed on Stack Overflow here.

gordthompson commented 3 years ago

I am unable to reproduce your issue using ODBC Driver 17 for SQL Server:

df = pd.read_sql_query("SELECT CAST(1 AS int) AS foo", engine)
print(df.info())
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   foo     1 non-null      int64
dtypes: int64(1)
memory usage: 136.0 bytes
None
"""

What ODBC driver are you using? Can you supply a minimal reproducible example?

AltafHussain4748 commented 3 years ago

@gordthompson

import pandas as pd li = [{'rowid': 976, 'INSTRPERMID': 8590928689}, {'rowid': 952, 'INSTRPERMID': None}] print(pd.DataFrame.from_dict(li))

Gives

rowid INSTRPERMID 0 976 8.590929e+09 1 952 NaN

v-chojas commented 3 years ago

You have yet to mention any details about your environment. What DBMS, OS, ODBC driver, Python version, etc.? If you don't supply that information we will be unable to assist.

AltafHussain4748 commented 3 years ago

@v-chojas
DBMS: MYSQL OS: UBUNTU ODBC DRIVER: PYODBC Python: >3.7

v-chojas commented 3 years ago

See https://github.com/mkleehammer/pyodbc/wiki/Troubleshooting-%E2%80%93-Generating-an-ODBC-trace-log and attach an ODBC trace.

gordthompson commented 3 years ago

I am able to reproduce the issue, but it is not a problem with pyodbc per se; it is a pandas issue. I get the same result if I use mysqlclient, which is SQLAlchemy's preferred DBAPI layer for working with MySQL:

connection_uri = 'mysql+mysqldb://scott:tiger@192.168.0.199/test'
engine = sa.create_engine(connection_uri, echo=True)

sql = """\
SELECT 1 AS foo
UNION ALL
SELECT NULL AS foo
"""
df = pd.read_sql_query(sql, engine)
print(df)
"""
   foo
0  1.0
1  NaN
"""
print(df.info())
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   foo     1 non-null      float64
dtypes: float64(1)
memory usage: 144.0 bytes
None
"""

sebarti commented 2 years ago

This is not a bug but normal behavior of pandas. Pandas reflects missing data as np.nan which itself is of dtype float64. This is why pandas casts series of integers that is missing values to float64 dtype (link).