Closed nipunraibansal closed 3 years ago
This is the (more or less) standard recipe for what you have described:
import pandas as pd
import sqlalchemy as sa
df = pd.read_excel(r"path\to\abc.xlsx")
connection_string = (
'Driver=ODBC Driver 17 for SQL Server;'
'Server=Server Name;'
'Database=Database Name;'
'UID=User ID;'
'PWD=Password;'
'Trusted_Connection=no;'
)
connection_url = sa.engine.URL.create(
"mssql+pyodbc",
query=dict(odbc_connect=connection_string)
)
engine = sa.create_engine(connection_url, fast_executemany=True)
# Deleting existing data in SQL Table:-
with engine.begin() as conn:
conn.exec_driver_sql("DELETE FROM SchemaName.TableName")
# upload the DataFrame
df.to_sql("TableName", engine, schema="SchemaName", if_exists="append", index=False)
FWIW, I gave a few methods of inserting to SQL Server some testing of my own. I was actually able to get the fastest results by using SQL Server Batches and using pyodbcCursor.execute statements. I did not test pandas dataframe to_sql / sqlalchemy, I wonder how it compares. Will sqlalchemy send batches or is it executing 1 by 1?
Here's my blog on the testing I did: http://jonmorisissqlblog.blogspot.com/2021/05/python-pyodbc-and-batch-inserts-to-sql.html
Will sqlalchemy send batches or is it executing 1 by 1?
With fast_executemany=True
pyodbc will prepare the SQL statement (sp_prepare
) and then use a mechanism called a "parameter array" to assemble the parameter values for multiple rows into a memory structure and send them all at once. SQL Profiler shows the activity as an sp_prepare
followed by an sp_exec
for each row, but that does not necessarily mean that the values are sent over the network separately for each row.
Will sqlalchemy send batches or is it executing 1 by 1?
With
fast_executemany=True
pyodbc will prepare the SQL statement (sp_prepare
) and then use a mechanism called a "parameter array" to assemble the parameter values for multiple rows into a memory structure and send them all at once. SQL Profiler shows the activity as ansp_prepare
followed by ansp_exec
for each row, but that does not necessarily mean that the values are sent over the network separately for each row.
SQL Server profiler was showing fast_executemany running prepared statement inserts one by one in my testing.
SQL Server profiler was showing fast_executemany running prepared statement inserts one by one in my testing.
Yes, because that's how prepared statements behave at such a low level. At some point each row has to be inserted. The difference between fast_executemany=True
and fast_executemany=False
is that
fast_executemany=False
will result in individual sp_prepexec
calls, resending the SQL command text each time, whilefast_executemany=True
will send the SQL command text once (to be prepared with sp_prepare
), send the parameter values in a big binary "blob", and let the SQL Server instance process each row as an sp_exec
call.In case you're interested, I did a quick comparison of to_sql()
vs. bcp
for a DataFrame with 5 columns x 1_000_000 rows:
t0 = perf_counter()
df.to_sql(table_name, remote_engine, index=False, if_exists="append")
print(f"to_sql(): {(perf_counter() - t0):.0f} seconds")
# to_sql(): 252 seconds
t0 = perf_counter()
csv_file_name = r"C:\__tmp\df_dump.csv"
df.to_csv(csv_file_name, sep="\t", index=False)
uid = "sa"
pwd = "_whatever_"
bcp_command = (
f"bcp dbo.destination IN {csv_file_name} "
f"-S 192.168.0.199 -U {uid} -P {pwd} -d test -F 2 -c"
)
subprocess.call(bcp_command, stdout=subprocess.DEVNULL)
print(f"bcp: {(perf_counter() - t0):.0f} seconds")
# bcp: 65 seconds
Note that this test was performed on a remote (LAN-connected) database. Given the relatively small performance boost from fast_executemany=True
in your blog post I suspect that you were testing against a local instance of SQL Server (so no network lag issues).
Pandas provides a function wherein you can execute db query for all rows i.e. executemany
First create a list of values from the dataframe :
records_to_insert = df.values.tolist()
Check the status of DB connection, it must be successful before applying insert query.
And eventually create the insert query function as follows (for e.g. you have two columns of Name and Age ):
def insertion_query(dict_db):
insert_query = """
INSERT INTO {0}
(
[Name],[Age]
)
VALUES
(?,?)
""".format(str(dict_db["DB_PythonTransactions_TableName"]))
return insert_query
After that executemany() can be used in main to insert all the data to database.
executemany(insert_query,records_to_insert)
For Reference : YT Video : DataFrame to DataBase Using Python
In addition to the "standard" method in the comment above (normal to_sql()
with fast_executemany=True
), users of SQL Server 2016+ can also use the custom to_sql()
"method" here
https://gist.github.com/gordthompson/1fb0f1c3f5edbf6192e596de8350f205
e.g.,
df.to_sql(
"table_name",
engine,
index=False,
if_exists="append",
method=mssql_insert_json,
)
which does not use fast_executemany=True
and in some cases can provide even faster performance.
import pandas as pd from sqlalchemy import create_engine
df = pd.DataFrame({ 'column1': [1, 2, 3], 'column2': ['A', 'B', 'C'] })
server = 'your_server' database = 'your_database' username = 'your_username' password = 'your_password' connection_string = f'mssql+pyodbc://{username}:{password}@{server}/{database}?driver=ODBC+Driver+17+for+SQL+Server'
engine = create_engine(connection_string)
df.to_sql('table_name', con=engine, if_exists='replace', index=False)
The if_exists parameter can be set to:
'fail': Raise a ValueError if the table already exists. 'replace': Drop the table before inserting new values. 'append': Insert new values to the existing table.
??? What's the point of replying to a closed 1.5y-old issue?
??? What's the point of replying to a closed 1.5y-old issue?
not your bussiness
Issue
I have been trying to insert data from a dataframe in Python to a table already created in SQL Server. The data frame has 90K rows and wanted the best possible way to quickly insert data in the table. I only have read,write and delete permissions for the server and I cannot create any table on the server.
Below is the code which is inserting the data but it is very slow. Please advise.
Code Used as Below
import pandas as pd import xlsxwriter import pyodbc
df = pd.read_excel(r"Url path\abc.xlsx") conn = pyodbc.connect('Driver={ODBC Driver 11 for SQL Server};' 'SERVER=Server Name;' 'Database=Database Name;' 'UID=User ID;' 'PWD=Password;' 'Trusted_Connection=no;') cursor= conn.cursor()
Deleting existing data in SQL Table:-
cursor.execute("DELETE FROM datbase.schema.TableName") conn.commit()
Inserting data in SQL Table:-
for index,row in df.iterrows(): cursor.execute("INSERT INTO Table Name([A],[B],[C],) values (?,?,?)", row['A'],row['B'],row['C']) conn.commit() cursor.close() conn.close()