kevn81 / pyodbc

Automatically exported from code.google.com/p/pyodbc
MIT No Attribution
0 stars 0 forks source link

slow handling of executemany #250

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Please refer to this stackoverflow link (I am the asker):

http://stackoverflow.com/a/9932361

Apparently ceODBC doesn't have this issue, though I haven't tested it yet.

Original issue reported on code.google.com by sok...@gmail.com on 29 Mar 2012 at 11:32

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Tried both ceODBC and mxODBC and both were also painfully slow.  Ended up going 
with an adodb connection with help from http://www.ecp.cc/pyado.html

        comConn = win32com.client.Dispatch(r'ADODB.Connection')
    DSN = 'PROVIDER=Microsoft.Jet.OLEDB.4.0;DATA SOURCE=%s%s' %(dbDIR,dbOut)
    comConn.Open(DSN)

    rs = win32com.client.Dispatch(r'ADODB.Recordset')
    rs.Open('[' + tblName +']', comConn, 1, 3)

    for f in values:
        rs.AddNew(LVI_LST, f)

Original comment by jaycgree...@gmail.com on 5 Apr 2012 at 9:02

GoogleCodeExporter commented 9 years ago
The behavior I encountered last time was as described by ghoerz in the 
aforementioned stackoverflow link, i.e.

    for each set of parameters
        prepare and execute the insert statement
        unprepare the statement
        commit

However, now that I try again, I couldn't reproduce that behavior anymore with 
any version of pyodbc from 2.1.8 to 3.0.6. Instead, it is now:

    for each set of parameters
        prepare and execute the insert statement
    unprepare the statement
    commit

This gives good enough performance.

Does anyone have the environment that can reproduce the slow behavior described 
by ghoerz? If not, I think we can close the issue. I am sorry that I wasn't 
diligent enough to note it down when reporting the issue.

Original comment by sok...@gmail.com on 31 Dec 2012 at 6:07

GoogleCodeExporter commented 9 years ago
Verified resolved with pyodbc-2.1.9.win32-py2.7

Original comment by bryan.ea...@gmail.com on 2 Jan 2013 at 8:44

Attachments:

GoogleCodeExporter commented 9 years ago
I'm still finding a very large difference in the performance of the 
'executemany' command between the pyodbc module and the ceODBC module.  In a 
nutshell, ceODBC is 35 times faster doing bulk inserts over pyodbc.

In my case, I'm inserting 250,000 records into a SQL Server 2008 R2 database 
(on its own server), using Python 2.7 on a Windows 7 (64-bit) PC, using the 
'{SQL Server Native Client 10.0}' driver.  I'm batching the records up into 
sets of 10,000 records before running 'executemany'.  This takes 12.5 minutes 
with pyodbc, but just 21 seconds with ceODBC.  A huge difference.

For reference, with this test, pyodbc is about 30% slower than the pymssql 
module.

Can you think of a reason why the pyodbc module should be so slow?  Is there 
something I should be doing in my Python code?

Original comment by erski...@dnc.org on 5 Aug 2013 at 1:36

GoogleCodeExporter commented 9 years ago
I don't think this issue is resolved. 

I'm using the most recent pyodbc-3.0.7-py2.7 and have found the same 
phenomenon. The same python code for pyodbc takes minutes, but for ceODBC takes 
only seconds. Version is ceODBC-2.0.1-py2.7. Critical code is

    conn = ceODBC.connect(conninfo, autocommit = True)
    cur = conn.cursor()
    cur.executemany(UPSERT_SQL, DATA)
    conn.close()

The executemany implementation of ceODBC must be many factors more efficient. 
If someone tells me how to do the profiling, I'm happy to help. The database I 
use is SAP HANA and for convenience, I use a 32-bit ODBC driver.

--Richard

Original comment by rlindne...@gmail.com on 29 Oct 2013 at 12:33

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I'm also having trouble with pyODBC performance.

I tested  1000 small rows into Netezza.

ceODBC executemany: 0.3 seconds
pyodbc executemany: 180-195 seconds
pyodbc execute: 78-130 seconds

Python 2.7 32-bit, pyodbc 3.0.7, ceodbc 2.0.1, Windows 7 64-bit, Netezza 
Release 7.0 (P-1) [Build 26407]

Original comment by ahz...@gmail.com on 18 Sep 2014 at 4:19

GoogleCodeExporter commented 9 years ago
If need to update 300.000 columns it never finishes. This bug really needs to 
be fixed.

Original comment by Atoc....@gmail.com on 30 Nov 2014 at 5:07

GoogleCodeExporter commented 9 years ago
AFAICS, Cursor_executemany is binding a single parameter and calling execute 
(which calls SQLExecute).

In ceODBC, it binds all parameters and then calls SQLExecute.

Am I reading this correctly?

Original comment by javier.r...@gmail.com on 15 Jan 2015 at 1:46

GoogleCodeExporter commented 9 years ago
I tested by running an insert statement for multiple rows of data using 
executemany.  In the SQL Server Profiler of the run, it showed every row being 
inserted using the sp_prepexec stored procedure.  This is causing implicit 
unprepares and explicit prepares of the query over and over again.  What I 
believe it should do is call sp_prepexec only for the first row (which prepares 
the query and executes the first row), and then call sp_execute for all 
subsequent rows in the batch (which will use the existing prepared query).

At the end of the batch, executemany does call an explicit sp_unprepare, which 
is fine, but I believe these implicit unprepares and explicit prepares are what 
is killing performance.  (You can see these in Profiler if you turn on the 
"Prepare SQL" and "Unprepare SQL" events in the "TSQL" events heading.)

Original comment by rich...@gmail.com on 10 Feb 2015 at 6:40

Attachments:

GoogleCodeExporter commented 9 years ago
It looks to me like one major problem is the function PrepareAndBind() (and how 
it gets used), which should be two separate functions in order to be able to 
prepare once and bind many times. This might also address the comment that says 
"REVIEW: Why don't we always prepare?" which is in a code section that would 
seem to have a fair amount of room for consolidation and simplification. so to 
fix:
- separate PrepareAndBind into 2 func
- consolidate the code that calls prepare() and bind() functions, and don't 
have separate cases for withparams or without params until you get to the bind 
call
- only call prepare once
- maybe consolidate the code for executemany() vs execute once etc

or maybe just ditch pyodbc and get ceodbc instead...

Original comment by mewa...@gmail.com on 27 Mar 2015 at 8:06

GoogleCodeExporter commented 9 years ago
Looks like there is also a malloc and free call for every Bind, where it may be 
better (and faster and cleaner) to have auto mem in the func that calling the 
initial prep, re-using that mem for each bind (no explicit free required at 
end).

Original comment by mewa...@gmail.com on 27 Mar 2015 at 8:37

GoogleCodeExporter commented 9 years ago
The problem is that pyodbc does not use "parameterized arrays" (whereas ceodbc 
does).  For a description of parameterized arrays, see the section "Using 
Arrays of Parameters" on this webpage:
https://www.progress.com/products/datadirect-connect/odbc-drivers/odbc-developer
-center/odbc-tutorials/designing-performance-optimized-odbc-applications/select-
functions-that-optimize-performance

So, for example, if you call pyodbc's executemany() function with an INSERT 
statement and a list of 10,000 records (each record consisting of another list 
of data fields), pyodbc will make 10,000 calls to the database where each call 
includes just one record.  This is very slow.  On the other hand, ceodbc will 
make just one call to the database with all 10,000 records in the payload.  
This is much, much quicker.

Having said that, not all databases accept parameterized arrays.  Microsoft SQL 
Server does:
https://msdn.microsoft.com/en-us/library/ms713841(v=vs.85).aspx
I don't know about other databases.

I suspect that the demand for parameterized arrays will increase now that 
sqlalchemy has added bulk INSERT operations to version 1.0:
http://docs.sqlalchemy.org/en/latest/orm/persistence_techniques.html#bulk-operat
ions
I'm sure there will soon be a lot of people out there using sqlalchemy 
wondering why the bulk operations are no quicker than non-bulk operations 
(especially with SQL Server).

Hopefully, Michael Kleehammer (or somebody else) will have some time to make 
this improvement.  It's really needed.

Original comment by toastie...@gmail.com on 28 Mar 2015 at 3:45