Closed opme closed 1 year ago
I am able to reproduce this issue with the latest master branch, i.e.,
pip install git+https://github.com/mkleehammer/pyodbc
Thanks for the report and the great MCVE!
A workaround would be to use this
https://gist.github.com/gordthompson/1fb0f1c3f5edbf6192e596de8350f205
along with
df.to_sql(table, engine, index=False, if_exists="append", method=mssql_insert_json)
I just tested it by tweaking the MCVE and it does not leak memory. Note that this method does not need fast_executemany=True
so that setting is irrelevant. (That is, it is not required. If specified it doesn't help but it also doesn't hurt.)
Thanks for the reply. I did check out the workaround gist and did a quick performance check. I was able to insert 10k rows in 10 seconds with fast_executemany and it took 15 seconds with the workaround gist. Around the same as turbodbc. fast_executemany is still the performance king and very desirable to those making many inserts.
Note: It matters how the dataframe is created before the call to to_sql whether there is a leak or not. If the dataframe is created with csv.DictReader() or pd.read_sql() the leak occurs. Creating the dataframe from pd.read_csv() is not leaking.
Could you post an ODBC trace to compare the two?
I compared the ODBC trace logs for the two methods of creating the DataFrame, namely "leaky"
f = open(filename, "r")
batch = f.read()
f.close()
reader = csv.DictReader(StringIO(batch), delimiter=";", quoting=csv.QUOTE_NONE)
rows = [row for row in reader]
df = pd.DataFrame(rows)
and "not leaky"
df = pd.read_csv(filename)
and they were identical. df.info()
reports identical results for the DataFrames created by the two methods.
I also found that if I moved the DataFrame creation out of the loop (and only created it once) then the leaking stopped.
With DataFrame creation inside the loop if I commented out the .to_sql()
call then there was no leak in either case.
So it seems to be something about that particular way of creating the DataFrame (using csv.DictReader
) interacting with .to_sql()
that triggers the leak.
I also found that if I moved the DataFrame creation out of the loop (and only created it once) then the leaking stopped.
That suggests it might not be an issue inside pyODBC itself; I'm not familiar with the external libraries in use here but if the way pyODBC is being called from them can be reproduced in a script which only uses pyODBC, that would either show its innocence or provide a good repro of the issue.
Py_INCREF(cell)
here looks suspicious since we are storing encoded
not cell
in pParam
:
// DAE
DAEParam *pParam = (DAEParam*)*outbuf;
Py_INCREF(cell);
pParam->cell = encoded.Detach();
This looks like a copy paste from here, where cell
is passed into pParam
:
Py_INCREF(cell);
DAEParam *pParam = (DAEParam*)*outbuf;
pParam->cell = cell;
Good find. Have you tried removing it?
Commenting out the Py_INCREF(cell);
line, i.e.,
@@ -344,11 +344,11 @@ static int PyToCType(Cursor *cur, unsigned char **outbuf, PyObject *cell, ParamI
len = PyBytes_GET_SIZE(encoded);
if (!pi->ColumnSize)
{
// DAE
DAEParam *pParam = (DAEParam*)*outbuf;
- Py_INCREF(cell);
+ // Py_INCREF(cell);
pParam->cell = encoded.Detach();
pParam->maxlen = cur->cnxn->GetMaxLength(pi->ValueType);
*outbuf += sizeof(DAEParam);
ind = cur->cnxn->need_long_data_len ? SQL_LEN_DATA_AT_EXEC((SQLLEN)len) : SQL_DATA_AT_EXEC;
}
does not completely stop the leak, but it does slow it down considerably.
Before patch:
iteration 0: rss: 81.0 MiB, vms: 275.7 MiB
iteration 1: rss: 84.0 MiB, vms: 278.5 MiB
iteration 2: rss: 85.7 MiB, vms: 280.4 MiB
iteration 3: rss: 87.5 MiB, vms: 282.2 MiB
iteration 4: rss: 89.1 MiB, vms: 283.7 MiB
iteration 5: rss: 90.7 MiB, vms: 285.4 MiB
iteration 6: rss: 91.7 MiB, vms: 286.6 MiB
iteration 7: rss: 93.0 MiB, vms: 287.8 MiB
iteration 8: rss: 94.0 MiB, vms: 288.8 MiB
iteration 9: rss: 95.1 MiB, vms: 289.8 MiB
iteration 10: rss: 96.1 MiB, vms: 291.1 MiB
iteration 11: rss: 97.4 MiB, vms: 292.1 MiB
iteration 12: rss: 98.4 MiB, vms: 293.1 MiB
iteration 13: rss: 99.5 MiB, vms: 294.3 MiB
iteration 14: rss: 100.5 MiB, vms: 295.3 MiB
iteration 15: rss: 101.8 MiB, vms: 296.6 MiB
iteration 16: rss: 102.8 MiB, vms: 297.6 MiB
iteration 17: rss: 103.8 MiB, vms: 298.6 MiB
iteration 18: rss: 104.9 MiB, vms: 299.8 MiB
iteration 19: rss: 106.2 MiB, vms: 300.8 MiB
…
iteration 80: rss: 173.0 MiB, vms: 367.8 MiB
iteration 81: rss: 174.3 MiB, vms: 369.0 MiB
iteration 82: rss: 175.3 MiB, vms: 370.0 MiB
iteration 83: rss: 176.3 MiB, vms: 371.0 MiB
iteration 84: rss: 177.4 MiB, vms: 372.3 MiB
iteration 85: rss: 178.9 MiB, vms: 373.6 MiB
iteration 86: rss: 179.9 MiB, vms: 374.9 MiB
iteration 87: rss: 181.0 MiB, vms: 375.9 MiB
iteration 88: rss: 182.3 MiB, vms: 376.9 MiB
iteration 89: rss: 183.3 MiB, vms: 378.1 MiB
iteration 90: rss: 184.3 MiB, vms: 379.1 MiB
iteration 91: rss: 185.6 MiB, vms: 380.1 MiB
iteration 92: rss: 186.7 MiB, vms: 381.4 MiB
iteration 93: rss: 187.8 MiB, vms: 382.4 MiB
iteration 94: rss: 188.8 MiB, vms: 383.6 MiB
iteration 95: rss: 189.8 MiB, vms: 384.6 MiB
iteration 96: rss: 191.1 MiB, vms: 385.6 MiB
iteration 97: rss: 192.1 MiB, vms: 386.9 MiB
iteration 98: rss: 193.2 MiB, vms: 387.9 MiB
iteration 99: rss: 194.4 MiB, vms: 388.9 MiB
After patch:
iteration 0: rss: 80.9 MiB, vms: 275.7 MiB
iteration 1: rss: 81.8 MiB, vms: 276.5 MiB
iteration 2: rss: 83.2 MiB, vms: 277.9 MiB
iteration 3: rss: 84.7 MiB, vms: 279.4 MiB
iteration 4: rss: 85.1 MiB, vms: 279.9 MiB
iteration 5: rss: 85.1 MiB, vms: 279.9 MiB
iteration 6: rss: 85.3 MiB, vms: 280.1 MiB
iteration 7: rss: 85.2 MiB, vms: 280.1 MiB
iteration 8: rss: 85.2 MiB, vms: 280.1 MiB
iteration 9: rss: 85.2 MiB, vms: 280.1 MiB
iteration 10: rss: 85.2 MiB, vms: 280.1 MiB
iteration 11: rss: 85.2 MiB, vms: 280.1 MiB
iteration 12: rss: 85.2 MiB, vms: 280.1 MiB
iteration 13: rss: 85.2 MiB, vms: 280.1 MiB
iteration 14: rss: 85.5 MiB, vms: 280.1 MiB
iteration 15: rss: 85.4 MiB, vms: 280.3 MiB
iteration 16: rss: 85.4 MiB, vms: 280.5 MiB
iteration 17: rss: 85.4 MiB, vms: 280.5 MiB
iteration 18: rss: 85.4 MiB, vms: 280.5 MiB
iteration 19: rss: 85.7 MiB, vms: 280.5 MiB
…
iteration 80: rss: 86.3 MiB, vms: 281.2 MiB
iteration 81: rss: 86.3 MiB, vms: 281.2 MiB
iteration 82: rss: 86.3 MiB, vms: 281.2 MiB
iteration 83: rss: 86.3 MiB, vms: 281.2 MiB
iteration 84: rss: 86.3 MiB, vms: 281.2 MiB
iteration 85: rss: 86.3 MiB, vms: 281.2 MiB
iteration 86: rss: 86.3 MiB, vms: 281.2 MiB
iteration 87: rss: 86.3 MiB, vms: 281.2 MiB
iteration 88: rss: 86.3 MiB, vms: 281.2 MiB
iteration 89: rss: 86.3 MiB, vms: 281.2 MiB
iteration 90: rss: 86.3 MiB, vms: 281.2 MiB
iteration 91: rss: 86.3 MiB, vms: 281.2 MiB
iteration 92: rss: 86.3 MiB, vms: 281.4 MiB
iteration 93: rss: 86.3 MiB, vms: 281.4 MiB
iteration 94: rss: 86.6 MiB, vms: 281.4 MiB
iteration 95: rss: 86.6 MiB, vms: 281.4 MiB
iteration 96: rss: 86.6 MiB, vms: 281.4 MiB
iteration 97: rss: 86.6 MiB, vms: 281.4 MiB
iteration 98: rss: 86.6 MiB, vms: 281.4 MiB
iteration 99: rss: 86.6 MiB, vms: 281.4 MiB
Environment
Issue
Using nvarchar(max) and fastexecutemany a memory leak is observed. There was a similar leak in the past with 854 and pull 832 that was fixed but this look to be a different issue.
Turning off fastexecutemany or using turbodbc does not exhibit the issue. Also tried with other column types and no leak was observed.
Note: It matters how the dataframe is created before the call to to_sql whether there is a leak or not. If the dataframe is created with csv.DictReader() or pd.read_sql() the leak occurs. Creating the dataframe from pd.read_csv() is not leaking.
Code to reproduce.
Table creation ddl (code is creating this)
Results of execution: