mkleehammer / pyodbc

Python ODBC bridge
https://github.com/mkleehammer/pyodbc/wiki
MIT No Attribution
2.94k stars 562 forks source link

Inserting Bulk Data of a dataframe into SQL Server Quickly #812

Closed nipunraibansal closed 3 years ago

nipunraibansal commented 4 years ago

Issue

I have been trying to insert data from a dataframe in Python to a table already created in SQL Server. The data frame has 90K rows and wanted the best possible way to quickly insert data in the table. I only have read,write and delete permissions for the server and I cannot create any table on the server.

Below is the code which is inserting the data but it is very slow. Please advise.

Code Used as Below

import pandas as pd import xlsxwriter import pyodbc

df = pd.read_excel(r"Url path\abc.xlsx") conn = pyodbc.connect('Driver={ODBC Driver 11 for SQL Server};' 'SERVER=Server Name;' 'Database=Database Name;' 'UID=User ID;' 'PWD=Password;' 'Trusted_Connection=no;') cursor= conn.cursor()

Deleting existing data in SQL Table:-

cursor.execute("DELETE FROM datbase.schema.TableName") conn.commit()

Inserting data in SQL Table:-

for index,row in df.iterrows(): cursor.execute("INSERT INTO Table Name([A],[B],[C],) values (?,?,?)", row['A'],row['B'],row['C']) conn.commit() cursor.close() conn.close()

gordthompson commented 4 years ago

This is the (more or less) standard recipe for what you have described:

import pandas as pd
import sqlalchemy as sa

df = pd.read_excel(r"path\to\abc.xlsx")

connection_string = (
    'Driver=ODBC Driver 17 for SQL Server;'
    'Server=Server Name;'
    'Database=Database Name;'
    'UID=User ID;'
    'PWD=Password;'
    'Trusted_Connection=no;'
)
connection_url = sa.engine.URL.create(
    "mssql+pyodbc", 
    query=dict(odbc_connect=connection_string)
)
engine = sa.create_engine(connection_url, fast_executemany=True)

# Deleting existing data in SQL Table:-
with engine.begin() as conn:
    conn.exec_driver_sql("DELETE FROM SchemaName.TableName")

# upload the DataFrame
df.to_sql("TableName", engine, schema="SchemaName", if_exists="append", index=False)
tesseract1919 commented 3 years ago

FWIW, I gave a few methods of inserting to SQL Server some testing of my own. I was actually able to get the fastest results by using SQL Server Batches and using pyodbcCursor.execute statements. I did not test pandas dataframe to_sql / sqlalchemy, I wonder how it compares. Will sqlalchemy send batches or is it executing 1 by 1?

Here's my blog on the testing I did: http://jonmorisissqlblog.blogspot.com/2021/05/python-pyodbc-and-batch-inserts-to-sql.html

gordthompson commented 3 years ago

Will sqlalchemy send batches or is it executing 1 by 1?

With fast_executemany=True pyodbc will prepare the SQL statement (sp_prepare) and then use a mechanism called a "parameter array" to assemble the parameter values for multiple rows into a memory structure and send them all at once. SQL Profiler shows the activity as an sp_prepare followed by an sp_exec for each row, but that does not necessarily mean that the values are sent over the network separately for each row.

tesseract1919 commented 3 years ago

Will sqlalchemy send batches or is it executing 1 by 1?

With fast_executemany=True pyodbc will prepare the SQL statement (sp_prepare) and then use a mechanism called a "parameter array" to assemble the parameter values for multiple rows into a memory structure and send them all at once. SQL Profiler shows the activity as an sp_prepare followed by an sp_exec for each row, but that does not necessarily mean that the values are sent over the network separately for each row.

SQL Server profiler was showing fast_executemany running prepared statement inserts one by one in my testing.

gordthompson commented 3 years ago

SQL Server profiler was showing fast_executemany running prepared statement inserts one by one in my testing.

Yes, because that's how prepared statements behave at such a low level. At some point each row has to be inserted. The difference between fast_executemany=True and fast_executemany=False is that

In case you're interested, I did a quick comparison of to_sql() vs. bcp for a DataFrame with 5 columns x 1_000_000 rows:

to_sql() with fast_executemany=True

t0 = perf_counter()
df.to_sql(table_name, remote_engine, index=False, if_exists="append")
print(f"to_sql(): {(perf_counter() - t0):.0f} seconds")
# to_sql(): 252 seconds

bcp

t0 = perf_counter()
csv_file_name = r"C:\__tmp\df_dump.csv"
df.to_csv(csv_file_name, sep="\t", index=False)

uid = "sa"
pwd = "_whatever_"
bcp_command = (
    f"bcp dbo.destination IN {csv_file_name} "
    f"-S 192.168.0.199 -U {uid} -P {pwd} -d test -F 2 -c"
)
subprocess.call(bcp_command, stdout=subprocess.DEVNULL)
print(f"bcp: {(perf_counter() - t0):.0f} seconds")
# bcp: 65 seconds

Note that this test was performed on a remote (LAN-connected) database. Given the relatively small performance boost from fast_executemany=True in your blog post I suspect that you were testing against a local instance of SQL Server (so no network lag issues).

manjeetkaur2195 commented 2 years ago

Pandas provides a function wherein you can execute db query for all rows i.e. executemany

First create a list of values from the dataframe :

records_to_insert = df.values.tolist()

Check the status of DB connection, it must be successful before applying insert query.

And eventually create the insert query function as follows (for e.g. you have two columns of Name and Age ):

def insertion_query(dict_db):
    insert_query = """
    INSERT INTO {0} 
    (
    [Name],[Age]
    ) 
    VALUES
    (?,?)
    """.format(str(dict_db["DB_PythonTransactions_TableName"]))
    return insert_query

After that executemany() can be used in main to insert all the data to database.

executemany(insert_query,records_to_insert)

For Reference : YT Video : DataFrame to DataBase Using Python

gordthompson commented 2 years ago

In addition to the "standard" method in the comment above (normal to_sql() with fast_executemany=True), users of SQL Server 2016+ can also use the custom to_sql() "method" here

https://gist.github.com/gordthompson/1fb0f1c3f5edbf6192e596de8350f205

e.g.,

    df.to_sql(
        "table_name",
        engine,
        index=False,
        if_exists="append",
        method=mssql_insert_json,
    )

which does not use fast_executemany=True and in some cases can provide even faster performance.

valiahmad commented 5 months ago

import pandas as pd from sqlalchemy import create_engine

Define the DataFrame

df = pd.DataFrame({ 'column1': [1, 2, 3], 'column2': ['A', 'B', 'C'] })

Define the connection string

server = 'your_server' database = 'your_database' username = 'your_username' password = 'your_password' connection_string = f'mssql+pyodbc://{username}:{password}@{server}/{database}?driver=ODBC+Driver+17+for+SQL+Server'

Create the SQLAlchemy engine

engine = create_engine(connection_string)

Write DataFrame to SQL Server

df.to_sql('table_name', con=engine, if_exists='replace', index=False)

valiahmad commented 5 months ago

The if_exists parameter can be set to:

'fail': Raise a ValueError if the table already exists. 'replace': Drop the table before inserting new values. 'append': Insert new values to the existing table.

v-chojas commented 5 months ago

??? What's the point of replying to a closed 1.5y-old issue?

valiahmad commented 5 months ago

??? What's the point of replying to a closed 1.5y-old issue?

not your bussiness