pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

BUG: pd.read_sql_query with chunksize = 0 should be treated like chunksize = None and it should return a DataFrame #35573

Open paolobellomo opened 4 years ago

paolobellomo commented 4 years ago

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# the details of the query are immaterial
from sqlalchemy import create_engine
_ENGINE = create_engine(
    "mssql+pymssql://{user}:{password}@{server}/{database}".format(
        user="my_username",
        password="my_password",
        server="serve.my_company.com",
        database="database_name",
    )
)
query = "SELECT SomeCol FROM SomeTable" 

# chunksize=None, it returns a DataFrame. GOOD
test_df = pd.read_sql_query(small_query, _ENGINE, chunksize=None)
print()
print("chunksize = None")
print(type(test_df))
print(test_df.shape)

# chunksize=100, it returns a Generator. GOOD
test_df = pd.read_sql_query(small_query, _ENGINE, chunksize=100)
print()
print("chunksize = 100")
print(type(test_df))
test_df = pd.concat(test_df, axis=0)
print(test_df.shape)

# chunksize=0, it returns a Generator. BAD
# pd.concat fails because the generator fails to return anything
test_df = pd.read_sql_query(small_query, _ENGINE, chunksize=0)
print()
print("chunksize = 0")
print(type(test_df))
test_df = pd.concat(test_df, axis=0)
print(test_df.shape)

Problem description

read_sql_query: chunksize can be either None or int. The case chunksize = 0 should be treated like chunksize = None and it should return a DataFrame. At the moment it returns a generator which fails to return any object when pd.concat is used Incidentally: the DocString of read_sql_query says it will return an Iterator. Not so.

Expected Output

Output

chunksize = None (1854, 8) chunksize = 100 (1854, 8) chunksize = 0 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in 16 print("chunksize = 0") 17 print(type(test_df)) ---> 18 test_df = pd.concat(test_df, axis=0) 19 print(test_df.shape) /usr/local/venvs/algovenv9/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy) 226 keys=keys, levels=levels, names=names, 227 verify_integrity=verify_integrity, --> 228 copy=copy, sort=sort) 229 return op.get_result() 230 /usr/local/venvs/algovenv9/lib/python3.6/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort) 260 261 if len(objs) == 0: --> 262 raise ValueError('No objects to concatenate') 263 264 if keys is None: ValueError: No objects to concatenate
mathurk1 commented 4 years ago

@paolobellomo can you make the title more descriptive please?

mathurk1 commented 4 years ago

Able to reproduce the BUG in master.

def main():

    sql_stmt = f"""SELECT * FROM public.pandas_test"""

    print(pd.__version__)
    chunk_none_df = pd.read_sql(sql_stmt, con=db_engine(), chunksize=None)
    print(type(chunk_none_df))
    chunk_1_df = pd.read_sql(sql_stmt, con=db_engine(), chunksize=1)
    print(type(chunk_1_df))
    chunk_0_df = pd.read_sql(sql_stmt, con=db_engine(), chunksize=0)
    print(type(chunk_0_df))

if __name__ == "__main__":
    main()

produces:

1.1.0.dev0+2073.g280efbfcc
<class 'pandas.core.frame.DataFrame'>
<class 'generator'>
<class 'generator'>
mathurk1 commented 4 years ago

Happy to look into this if it needs to be implemented and contributions are welcome