Closed ddavis226 closed 2 years ago
If I pass the zip generator object to cursor_executemany rather than the values list, then the memory issue goes away but it runs slow like its no longer in fastexecute mode, which is what I suspect is happening.
Correct. See #500
If I turn off fastexecutemany, then I don't have any memory issues but I have speed issues so I definitely need to make it work with fastexecutemany.
As a workaround you might want to experiment with the JSON trick described here.
@ddavis226 : are you loading to a heap, clustered index or columnstore? Is there already data in the table or is the table empty?
Is the clustered index unique? Why do you use batches? How do you measure memory in Azure SQL DB? Because in Azure SQL DB one does not have access to OS. SQL Server uses all the memory you give it, so if Server is idle and just restarted then it does not consume so much memory, but when you start to use the instance then it will consume memory and it does not release it. Also SQL Server uses memory for many things, e.g. to cache data, workspace memory (e.g. for sorts), etc.
What problems are you facing other than you don't like the number with memory? My guess is that you think you have a problem, but actually everything is fine.
Also why do you use autocommit=False?
@ddavis226 : are you loading to a heap, clustered index or columnstore? Is there already data in the table or is the table empty?
Is the clustered index unique? Why do you use batches? How do you measure memory in Azure SQL DB? Because in Azure SQL DB one does not have access to OS. SQL Server uses all the memory you give it, so if Server is idle and just restarted then it does not consume so much memory, but when you start to use the instance then it will consume memory and it does not release it. Also SQL Server uses memory for many things, e.g. to cache data, workspace memory (e.g. for sorts), etc.
What problems are you facing other than you don't like the number with memory? My guess is that you think you have a problem, but actually everything is fine.
Also why do you use autocommit=False?
Thank you for your comments. The memory that I am measuring is on a VM that is doing the data loading itself. The data is streamed in batches from a parquet file and then each batch of records are then sent to Azure SQL via pyodbc connection. I turn off the autocommit because I do not want each row to be committed each time - I commit at the end of the batch of 20k rows.
The target table is a heap that is empty but of course since I am doing it in batches, it is only empty the first time.
I do batches because this is a very large file I am reading and there is no reason to load the entire file in memory on the server. Doing batches (chunking the workload) is a common method for dealing with very large files for scalability reasons.
I do believe I have a real problem but I do not think I have a problem with SQL Server.
I hope this clears up any confusion. Of course when I explained it, it was clear to me what I was talking about ;-)
If I pass the zip generator object to cursor_executemany rather than the values list, then the memory issue goes away but it runs slow like its no longer in fastexecute mode, which is what I suspect is happening.
Correct. See #500
If I turn off fastexecutemany, then I don't have any memory issues but I have speed issues so I definitely need to make it work with fastexecutemany.
As a workaround you might want to experiment with the JSON trick described here.
Thank you. I will take a look at the workaround.
I would like to know if the fastexecutemany method is holding onto decimals in some way though. If so, then I would think this would be an issue others are running into as well. Also, maybe this is something that has been introduced in some recent version and I could drop back a few versions?
As you'll notice in my writeup, I can load a LOT of string data and int data and it is fabulous - no issues. Only when I introduce decimals do I see an issue with memory not getting released.
@ddavis226: I see everything from my perspective which is SQL Server, so I thought you were talking about memory problems on the SQL Server. Loading in Batches could also improve writing to the transaction log and with columnstore it is also important to load in batches (but more than 20k rows for each batch).
Repro code for this issue:
from decimal import Decimal
import psutil as psutil
import pyodbc
connection_string = (
"DRIVER=ODBC Driver 17 for SQL Server;"
"SERVER=192.168.0.199;"
"UID=scott;PWD=tiger^5HHH;"
"DATABASE=test;"
)
cnxn = pyodbc.connect(connection_string)
crsr = cnxn.cursor()
crsr.fast_executemany = True
table_name = "gh1026"
num_columns = 127
crsr.execute(f"DROP TABLE IF EXISTS {table_name}")
column_names = [f"col{i:03}" for i in range(num_columns)]
column_type = "decimal(18, 4)"
column_definitions = ", ".join(f"{x} {column_type}" for x in column_names)
sql = (
f"CREATE TABLE {table_name} (id int identity primary key, "
f"{column_definitions}"
")"
)
crsr.execute(sql)
sql = f"INSERT INTO {table_name} ({', '.join(column_names)}) VALUES ({', '.join('?' * num_columns)})"
value = Decimal("3.14")
batch_size = 2000
data = [tuple([value for c in range(num_columns)]) for r in range(batch_size)]
process = psutil.Process()
def printmem(msg):
mb = process.memory_info().vms / 1048576
print(f"{msg}: {mb:0.1f} MiB")
printmem("before loop")
for loop in range(3):
crsr.executemany(sql, data)
cnxn.commit()
printmem(f"after {loop + 1} iteration{'s' if loop else ''}")
console output:
before loop: 12.5 MiB
after 1 iteration: 31.6 MiB
after 2 iterations: 50.6 MiB
after 3 iterations: 70.7 MiB
However, with value = 3.14
instead of value = Decimal("3.14")
we see:
before loop: 12.5 MiB
after 1 iteration: 12.5 MiB
after 2 iterations: 12.5 MiB
after 3 iterations: 12.5 MiB
Additional information: The above code was run using pyodbc 4.0.32 (current release). It also reproduces the issue under the current master branch and also using the modifications from PR #703 from @Mizaro , so it looks like it really is a new issue.
https://github.com/mkleehammer/pyodbc/blob/master/src/params.cpp#L554 This might be the leaked object. There's no Py_XDECREF for it.
@v-chojas - Thanks for the pointer. I added a Py_XDECREF
for it but the results remain unchanged
@@ -564,10 +564,11 @@ static int PyToCType(Cursor *cur, unsigned char **outbuf, PyObject *cell, ParamI
PyObject *scaledDecimal = PyObject_CallObject((PyObject*)cell->ob_type, args);
PyObject *digitLong = PyNumber_Long(scaledDecimal);
Py_XDECREF(args);
Py_XDECREF(scaledDecimal);
+ Py_XDECREF(newDigits);
Py_XDECREF(cellParts);
pNum->precision = (SQLCHAR)pi->ColumnSize;
pNum->scale = (SQLCHAR)pi->DecimalDigits;
@v-chojas - My apologies. PyCharm didn't see the change until I did
pip install -e /home/gord/git/pyodbc/
Now (on Linux) I get
before loop: 39.8 MiB
after 1 iteration: 40.2 MiB
after 2 iterations: 46.8 MiB
after 3 iterations: 46.8 MiB
after 4 iterations: 46.8 MiB
after 5 iterations: 46.8 MiB
after 6 iterations: 46.8 MiB
after 7 iterations: 46.8 MiB
after 8 iterations: 46.8 MiB
Good analysis @gordthompson , and nice find @v-chojas .
@ddavis226 - If you would like to try out the patch you can
pip install git+https://github.com/gordthompson/pyodbc@decimal_leak
I would be oh-so-grateful!! Doing so now.
This is working as expected now! Thank you so much for the incredibly quick turnaround!
I am assuming I close the issue now?
Hello,
I have the same issue, I assume this fix will be released in the next version?
Is there any way I can use this fix before then? Or is there any workaround?
Thanks.
@jdudleyie - You can install pyodbc from the latest master branch
pip install git+https://github.com/mkleehammer/pyodbc
Thanks @gordthompson. I assume I could do this for testing that it resolves the issue, but I should not do this in production as there are other commits since the last release so it’s probably not safe?
If I wanted to use this fix in production is there a way for me to get or create a build of the latest current released version plus only this new fix? Thanks.
there are other commits since the last release so it’s probably not safe
I don't know why you would assume that. Those changes have been vetted and accepted so they will be included in the next official release anyway (whenever that is).
Thank you, I didn’t realise, that’s good to know. (If each of those commits is vetted then I am wondering why a new release is not automatically created for each merge to master so they are easily available to consumers.)
I suppose I am used to using a specific version of a library in production and consciously choose when to upgrade, so if I was to change my docker image/production deployment process temporarily to pip install pyodbc master so as to include this fix, then at any time a new commit could come in that may affect something in production and it may be harder to track it down, but if I had recently upgraded package versions, then at least I would know that’s something that I had changed.
Is there a way to pip install a specific commit on the pyodbc master branch?
Apologies for all the questions, just trying to assess the safest way to get this change into my production environment in order to resolve the memory leak and I’m not that familiar with these options that don’t involve bumping a package version.
You can check out specific commit ID, instead of only the latest of the master.
Running the below worked for me and appears to resolve the memory leak, both on Windows and also in Linux, thanks for the help.
pip install git+https://github.com/mkleehammer/pyodbc@e67ad89a5ed8066b7f1272b96688da0a5c24c1c5
Environment
I am attempting to load a very wide table that consists of 99 columns that include 44 decimal(28,9), 34 uniqueidentifiers, 12 ints, 4 bits, 5 dates, etc. See schema below.
My python application is running on a VM with 8 gb of memory with no load other than my application. My application reads from a very large parque file in a streaming fashion and sends batches to Azure SQL via pyodbc's executemany method.
I am using the fastexecutemany = True, Autocommit = false. I am currently testing with loading in batches of 20,000 rows.
I expect to be able to call cursor.executemany, passing a list of 20,000 tuples in a loop and be able to successfully load nearly 10 million rows with my application memory not getting consumed in a linear fashion. What I am observing is if I include decimals in my list that I send to executemany(), then it appears that memory is not getting released from executemany() such that with each iteration I can observe the memory usage consistently growing.
Immediately out of the gate, my server's memory went quickly (linearly) straight down until it consumed all of the memory (8GB). So, I began investigating, including removing columns from the list until I could observe some pattern. Here are some things that I notice with memory:
If I take all of the decimals out of the list, leaving the other 55 columns, then the entire batch runs successfully with memory staying below 1/2GB!
If I add 4 decimals to the list of 55, then by the 192th iteration of the loop, memory consumption is already hitting 3GB. I stopped it there and didn't let it continue to the 500th loop because it didn't seem worth it.
If I JUST load Id (uniqueidentifier) and a single decimal column, then memory grows to the point that after the 10 million rows have been inserted, 1.7 GB of memory has been consumed in the 12 minutes. Given that I am using a for-loop to load in batches, it should allow for the values to be released as they go out of scope so why does the memory usage grow consistently over the entire time?
I remove just the cursor.executemany from the loop and memory is not an issue so I am positive it is isolated to the cursor.
If I turn off fastexecutemany, then I don't have any memory issues but I have speed issues so I definitely need to make it work with fastexecutemany.
If I pass the zip generator object to cursor_executemany rather than the values list, then the memory issue goes away but it runs slow like its no longer in fastexecute mode, which is what I suspect is happening.
It seems that decimals are key to the memory issue that I am seeing. Why does including a decimal consume so much memory without it getting released?
The simplest way to reproduce this is to do the third test above with a list of 20,000 tuples of (string, decimal) where the string is a GUID. Observe how memory grows in a linear fashion.
I appreciate any assistance you can provide! Thank you for all you do!
Here is my target table:
Target Table: [Id] [uniqueidentifier] NOT NULL, [SourceCompany] [uniqueidentifier] NULL, [Company] [uniqueidentifier] NULL, [Account] [uniqueidentifier] NULL, [Organization] [uniqueidentifier] NULL, [BudgetCategory] [uniqueidentifier] NULL, [LaborExpense] [int] NULL, [Project] [uniqueidentifier] NULL, [Task] [uniqueidentifier] NULL, [PostingDate] [uniqueidentifier] NULL, [PostingDateDate] [date] NULL, [PostingPeriod] [uniqueidentifier] NULL, [Year] [int] NULL, [Period] [int] NULL, [YearPeriod] [int] NULL, [InputDocumentType] [int] NULL, [InputDocumentSource] [int] NULL, [InputDocumentSourceId] [uniqueidentifier] NULL, [InputDocumentSuffix] [uniqueidentifier] NULL, [DocumentNbr] nvarchar NULL, [DocumentDate] [date] NULL, [TransactionDate] [date] NULL, [DocumentId] [uniqueidentifier] NULL, [LineItemType] [int] NULL, [LineItemId] [uniqueidentifier] NULL, [ProcessTime] datetime2 NULL, [EvcType] [int] NULL, [EmployeeVendorClient] [uniqueidentifier] NULL, [Employee] [uniqueidentifier] NULL, [Vendor] [uniqueidentifier] NULL, [Client] [uniqueidentifier] NULL, [Equipment] [uniqueidentifier] NULL, [Unit] [uniqueidentifier] NULL, [ExpenseReportCategory] [uniqueidentifier] NULL, [DocumentCurrency] [uniqueidentifier] NULL, [TransactionCurrency] [uniqueidentifier] NULL, [CostDC] [decimal](28, 9) NULL, [CostTC] [decimal](28, 9) NULL, [CostCC] [decimal](28, 9) NULL, [CostPC] [decimal](28, 9) NULL, [CostBC] [decimal](28, 9) NULL, [EffectiveBillCostPC] [decimal](28, 9) NULL, [EffectiveBillCostCC] [decimal](28, 9) NULL, [EffectiveBillCostBC] [decimal](28, 9) NULL, [EffortPC] [decimal](28, 9) NULL, [EffortCC] [decimal](28, 9) NULL, [EffortBC] [decimal](28, 9) NULL, [WriteOffCostDC] [decimal](28, 9) NULL, [WriteOffCostTC] [decimal](28, 9) NULL, [WriteOffCostCC] [decimal](28, 9) NULL, [WriteOffCostPC] [decimal](28, 9) NULL, [WriteOffCostBC] [decimal](28, 9) NULL, [WriteOffEffortCC] [decimal](28, 9) NULL, [WriteOffEffortPC] [decimal](28, 9) NULL, [WriteOffEffortBC] [decimal](28, 9) NULL, [WriteOffQty] [decimal](28, 9) NULL, [CostBasis] [int] NULL, [EffortBasis] [int] NULL, [Quantity] [decimal](28, 9) NULL, [CostRateOverride] [bit] NULL, [EffectiveCostRateDC] [decimal](28, 9) NULL, [EffectiveCostRateTC] [decimal](28, 9) NULL, [EffectiveCostRatePC] [decimal](28, 9) NULL, [EffectiveCostRateCC] [decimal](28, 9) NULL, [EffectiveCostRateBC] [decimal](28, 9) NULL, [EffectiveBillCostRatePC] [decimal](28, 9) NULL, [EffectiveBillCostRateCC] [decimal](28, 9) NULL, [EffectiveBillCostRateBC] [decimal](28, 9) NULL, [RegularCostRateDC] [decimal](28, 9) NULL, [RegularCostRateTC] [decimal](28, 9) NULL, [RegularCostRatePC] [decimal](28, 9) NULL, [RegularCostRateCC] [decimal](28, 9) NULL, [RegularCostRateBC] [decimal](28, 9) NULL, [RegularBillCostRatePC] [decimal](28, 9) NULL, [RegularBillCostRateCC] [decimal](28, 9) NULL, [RegularBillCostRateBC] [decimal](28, 9) NULL, [EffortRatePC] [decimal](28, 9) NULL, [EffortRateCC] [decimal](28, 9) NULL, [EffortRateBC] [decimal](28, 9) NULL, [EffortMultiplier] [decimal](28, 9) NULL, [PremiumEffortMultiplier] [decimal](28, 9) NULL, [FeeType] [uniqueidentifier] NULL, [RevenueRule] [int] NULL, [ActivityType] [uniqueidentifier] NULL, [LaborBillClass] [uniqueidentifier] NULL, [HoursType] [uniqueidentifier] NULL, [ProjectDescription] [uniqueidentifier] NULL, [ExternalReferenceNbr] nvarchar NULL, [ExternalReferenceDate] [date] NULL, [Overtime] [bit] NULL, [HomeCompany] [uniqueidentifier] NULL, [HomeOrg] [uniqueidentifier] NULL, [Billable] [bit] NULL, [BillItemId] [uniqueidentifier] NULL, [OriginalProjectDetailId] [uniqueidentifier] NULL, [Location] [uniqueidentifier] NULL, [ItemId] [uniqueidentifier] NULL, [ExcludeFromStaffUtilization] [bit] NULL, [BatchId] [int] NULL