Pandas.DataFrame.to_sql() Default Insert Behavior Assumes Table Creation

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.74k stars 17.95k forks source link

Pandas.DataFrame.to_sql() Default Insert Behavior Assumes Table Creation #29138

Open JordanPavlic opened 5 years ago

JordanPavlic commented 5 years ago

Problem description

Pandas.DataFrame.to_sql() assumes that if no table exists it should create one. In the specific case where a user wants to simply insert records, but not have to_sql() generate a table if it doesn't exist this causes problems. There is also no way to currently override the default behavior of creating a table according to the documentation.

I'm running into this problem because I have an application that has to dynamically generate database tables for storage as part of an ETL pipeline where I'm using to_sql() to insert records into the tables. If for any number of reasons table creation fails during this process and no table is created, then when to_sql() is called the default behavior of Pandas.DataFrame.to_sql() is to create a new table with whatever types the fields of the dataframe are currently in. This table created by Pandas is missing all the indexes, primary/foreign keys, etc that the dynamic table generation process creates.

A fix might be to add a new parameter, if_not_exists, that allows a user to set the behavior of to_sql() if a table is not detected.

WillAyd commented 5 years ago

Have you tried if_exists?

JordanPavlic commented 5 years ago

If_exists only determines the behavior of what to_sql() will do if a table exists. It does not however allow a user to set the behavior when no table exists i.e. if it should create a table or simply fail by raising an exception.

This could be controlled with a boolean that allows a user to specify whether pandas should create a table or not when a table doesn't exist in the database.

if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’ How to behave if the table already exists.

fail: Raise a ValueError. replace: Drop the table before inserting new values. append: Insert new values to the existing table.

zbrookle commented 5 years ago

I think that rather than having options for behaviors when a table doesn't exist, maybe a table just shouldn't be created by default when it doesn't exist and a user tries to insert records. I'd be happy to work on finding and taking out/changing the piece of code that's creating a table when it doesn't exist for inserts.

kjford commented 4 years ago

Possible solution:

# assuming sqlalchemy `Engine` as engine
if engine.has_table('my_table'):
    df.to_sql('my_table', engine, if_exists='append')

I disagree with changing the default behavior of creating a table since calling to_sql is (most?) commonly called with the intent of creating a table.

The current logic handles the various options for if_exists or creates the table if none exists, then inserts data. Currently, the only option that doesn't (attempt to) insert data is fail which raises a ValueError. I interpret this as defining the scope of the to_sql method as insert data into the database or else raise an error.

mhueser commented 2 years ago

There is another related issue that I faced recently, the new table schema is a modification of the old one, for example new columns are introduced. In this case, would it be possible to make the 'replace' implementation use ALTER commands instead of DROP? Or provide such option for the parameter 'if_exists". This avoids cascading delete of dependent views.

mhueser commented 2 years ago

TRUNCATE will fail if the table schema is changed.