mkleehammer / pyodbc

Python ODBC bridge
https://github.com/mkleehammer/pyodbc/wiki
MIT No Attribution
2.92k stars 563 forks source link

Inserting decimal datatypes using fastexecutemany=True causes memory issues #1026

Closed ddavis226 closed 2 years ago

ddavis226 commented 2 years ago

Environment

I am attempting to load a very wide table that consists of 99 columns that include 44 decimal(28,9), 34 uniqueidentifiers, 12 ints, 4 bits, 5 dates, etc. See schema below.

My python application is running on a VM with 8 gb of memory with no load other than my application. My application reads from a very large parque file in a streaming fashion and sends batches to Azure SQL via pyodbc's executemany method.

I am using the fastexecutemany = True, Autocommit = false. I am currently testing with loading in batches of 20,000 rows.

I expect to be able to call cursor.executemany, passing a list of 20,000 tuples in a loop and be able to successfully load nearly 10 million rows with my application memory not getting consumed in a linear fashion. What I am observing is if I include decimals in my list that I send to executemany(), then it appears that memory is not getting released from executemany() such that with each iteration I can observe the memory usage consistently growing.

...
for record_batch in reader.iter_batches(batch_size=20,000)
        cursor.fast_executemany = True
        values = list(zip(*map(lambda col: col.to_pylist(), record_batch.columns)))
       cursor.executemany(insert_stmt, values)
       cursor.commit()
...

Immediately out of the gate, my server's memory went quickly (linearly) straight down until it consumed all of the memory (8GB). So, I began investigating, including removing columns from the list until I could observe some pattern. Here are some things that I notice with memory:

It seems that decimals are key to the memory issue that I am seeing. Why does including a decimal consume so much memory without it getting released?

The simplest way to reproduce this is to do the third test above with a list of 20,000 tuples of (string, decimal) where the string is a GUID. Observe how memory grows in a linear fashion.

I appreciate any assistance you can provide! Thank you for all you do!

Here is my target table:

Target Table: [Id] [uniqueidentifier] NOT NULL, [SourceCompany] [uniqueidentifier] NULL, [Company] [uniqueidentifier] NULL, [Account] [uniqueidentifier] NULL, [Organization] [uniqueidentifier] NULL, [BudgetCategory] [uniqueidentifier] NULL, [LaborExpense] [int] NULL, [Project] [uniqueidentifier] NULL, [Task] [uniqueidentifier] NULL, [PostingDate] [uniqueidentifier] NULL, [PostingDateDate] [date] NULL, [PostingPeriod] [uniqueidentifier] NULL, [Year] [int] NULL, [Period] [int] NULL, [YearPeriod] [int] NULL, [InputDocumentType] [int] NULL, [InputDocumentSource] [int] NULL, [InputDocumentSourceId] [uniqueidentifier] NULL, [InputDocumentSuffix] [uniqueidentifier] NULL, [DocumentNbr] nvarchar NULL, [DocumentDate] [date] NULL, [TransactionDate] [date] NULL, [DocumentId] [uniqueidentifier] NULL, [LineItemType] [int] NULL, [LineItemId] [uniqueidentifier] NULL, [ProcessTime] datetime2 NULL, [EvcType] [int] NULL, [EmployeeVendorClient] [uniqueidentifier] NULL, [Employee] [uniqueidentifier] NULL, [Vendor] [uniqueidentifier] NULL, [Client] [uniqueidentifier] NULL, [Equipment] [uniqueidentifier] NULL, [Unit] [uniqueidentifier] NULL, [ExpenseReportCategory] [uniqueidentifier] NULL, [DocumentCurrency] [uniqueidentifier] NULL, [TransactionCurrency] [uniqueidentifier] NULL, [CostDC] [decimal](28, 9) NULL, [CostTC] [decimal](28, 9) NULL, [CostCC] [decimal](28, 9) NULL, [CostPC] [decimal](28, 9) NULL, [CostBC] [decimal](28, 9) NULL, [EffectiveBillCostPC] [decimal](28, 9) NULL, [EffectiveBillCostCC] [decimal](28, 9) NULL, [EffectiveBillCostBC] [decimal](28, 9) NULL, [EffortPC] [decimal](28, 9) NULL, [EffortCC] [decimal](28, 9) NULL, [EffortBC] [decimal](28, 9) NULL, [WriteOffCostDC] [decimal](28, 9) NULL, [WriteOffCostTC] [decimal](28, 9) NULL, [WriteOffCostCC] [decimal](28, 9) NULL, [WriteOffCostPC] [decimal](28, 9) NULL, [WriteOffCostBC] [decimal](28, 9) NULL, [WriteOffEffortCC] [decimal](28, 9) NULL, [WriteOffEffortPC] [decimal](28, 9) NULL, [WriteOffEffortBC] [decimal](28, 9) NULL, [WriteOffQty] [decimal](28, 9) NULL, [CostBasis] [int] NULL, [EffortBasis] [int] NULL, [Quantity] [decimal](28, 9) NULL, [CostRateOverride] [bit] NULL, [EffectiveCostRateDC] [decimal](28, 9) NULL, [EffectiveCostRateTC] [decimal](28, 9) NULL, [EffectiveCostRatePC] [decimal](28, 9) NULL, [EffectiveCostRateCC] [decimal](28, 9) NULL, [EffectiveCostRateBC] [decimal](28, 9) NULL, [EffectiveBillCostRatePC] [decimal](28, 9) NULL, [EffectiveBillCostRateCC] [decimal](28, 9) NULL, [EffectiveBillCostRateBC] [decimal](28, 9) NULL, [RegularCostRateDC] [decimal](28, 9) NULL, [RegularCostRateTC] [decimal](28, 9) NULL, [RegularCostRatePC] [decimal](28, 9) NULL, [RegularCostRateCC] [decimal](28, 9) NULL, [RegularCostRateBC] [decimal](28, 9) NULL, [RegularBillCostRatePC] [decimal](28, 9) NULL, [RegularBillCostRateCC] [decimal](28, 9) NULL, [RegularBillCostRateBC] [decimal](28, 9) NULL, [EffortRatePC] [decimal](28, 9) NULL, [EffortRateCC] [decimal](28, 9) NULL, [EffortRateBC] [decimal](28, 9) NULL, [EffortMultiplier] [decimal](28, 9) NULL, [PremiumEffortMultiplier] [decimal](28, 9) NULL, [FeeType] [uniqueidentifier] NULL, [RevenueRule] [int] NULL, [ActivityType] [uniqueidentifier] NULL, [LaborBillClass] [uniqueidentifier] NULL, [HoursType] [uniqueidentifier] NULL, [ProjectDescription] [uniqueidentifier] NULL, [ExternalReferenceNbr] nvarchar NULL, [ExternalReferenceDate] [date] NULL, [Overtime] [bit] NULL, [HomeCompany] [uniqueidentifier] NULL, [HomeOrg] [uniqueidentifier] NULL, [Billable] [bit] NULL, [BillItemId] [uniqueidentifier] NULL, [OriginalProjectDetailId] [uniqueidentifier] NULL, [Location] [uniqueidentifier] NULL, [ItemId] [uniqueidentifier] NULL, [ExcludeFromStaffUtilization] [bit] NULL, [BatchId] [int] NULL

gordthompson commented 2 years ago

If I pass the zip generator object to cursor_executemany rather than the values list, then the memory issue goes away but it runs slow like its no longer in fastexecute mode, which is what I suspect is happening.

Correct. See #500

If I turn off fastexecutemany, then I don't have any memory issues but I have speed issues so I definitely need to make it work with fastexecutemany.

As a workaround you might want to experiment with the JSON trick described here.

davidbaniadam commented 2 years ago

@ddavis226 : are you loading to a heap, clustered index or columnstore? Is there already data in the table or is the table empty?

Is the clustered index unique? Why do you use batches? How do you measure memory in Azure SQL DB? Because in Azure SQL DB one does not have access to OS. SQL Server uses all the memory you give it, so if Server is idle and just restarted then it does not consume so much memory, but when you start to use the instance then it will consume memory and it does not release it. Also SQL Server uses memory for many things, e.g. to cache data, workspace memory (e.g. for sorts), etc.

What problems are you facing other than you don't like the number with memory? My guess is that you think you have a problem, but actually everything is fine.

Also why do you use autocommit=False?

ddavis226 commented 2 years ago

@ddavis226 : are you loading to a heap, clustered index or columnstore? Is there already data in the table or is the table empty?

Is the clustered index unique? Why do you use batches? How do you measure memory in Azure SQL DB? Because in Azure SQL DB one does not have access to OS. SQL Server uses all the memory you give it, so if Server is idle and just restarted then it does not consume so much memory, but when you start to use the instance then it will consume memory and it does not release it. Also SQL Server uses memory for many things, e.g. to cache data, workspace memory (e.g. for sorts), etc.

What problems are you facing other than you don't like the number with memory? My guess is that you think you have a problem, but actually everything is fine.

Also why do you use autocommit=False?

Thank you for your comments. The memory that I am measuring is on a VM that is doing the data loading itself. The data is streamed in batches from a parquet file and then each batch of records are then sent to Azure SQL via pyodbc connection. I turn off the autocommit because I do not want each row to be committed each time - I commit at the end of the batch of 20k rows.

The target table is a heap that is empty but of course since I am doing it in batches, it is only empty the first time.

I do batches because this is a very large file I am reading and there is no reason to load the entire file in memory on the server. Doing batches (chunking the workload) is a common method for dealing with very large files for scalability reasons.

I do believe I have a real problem but I do not think I have a problem with SQL Server.

I hope this clears up any confusion. Of course when I explained it, it was clear to me what I was talking about ;-)

ddavis226 commented 2 years ago

If I pass the zip generator object to cursor_executemany rather than the values list, then the memory issue goes away but it runs slow like its no longer in fastexecute mode, which is what I suspect is happening.

Correct. See #500

If I turn off fastexecutemany, then I don't have any memory issues but I have speed issues so I definitely need to make it work with fastexecutemany.

As a workaround you might want to experiment with the JSON trick described here.

Thank you. I will take a look at the workaround.
I would like to know if the fastexecutemany method is holding onto decimals in some way though. If so, then I would think this would be an issue others are running into as well. Also, maybe this is something that has been introduced in some recent version and I could drop back a few versions?
As you'll notice in my writeup, I can load a LOT of string data and int data and it is fabulous - no issues. Only when I introduce decimals do I see an issue with memory not getting released.

davidbaniadam commented 2 years ago

@ddavis226: I see everything from my perspective which is SQL Server, so I thought you were talking about memory problems on the SQL Server. Loading in Batches could also improve writing to the transaction log and with columnstore it is also important to load in batches (but more than 20k rows for each batch).

gordthompson commented 2 years ago

Repro code for this issue:

from decimal import Decimal

import psutil as psutil
import pyodbc

connection_string = (
    "DRIVER=ODBC Driver 17 for SQL Server;"
    "SERVER=192.168.0.199;"
    "UID=scott;PWD=tiger^5HHH;"
    "DATABASE=test;"
)
cnxn = pyodbc.connect(connection_string)
crsr = cnxn.cursor()
crsr.fast_executemany = True

table_name = "gh1026"
num_columns = 127

crsr.execute(f"DROP TABLE IF EXISTS {table_name}")
column_names = [f"col{i:03}" for i in range(num_columns)]
column_type = "decimal(18, 4)"
column_definitions = ", ".join(f"{x} {column_type}" for x in column_names)
sql = (
    f"CREATE TABLE {table_name} (id int identity primary key, "
    f"{column_definitions}"
    ")"
)
crsr.execute(sql)
sql = f"INSERT INTO {table_name} ({', '.join(column_names)}) VALUES ({', '.join('?' * num_columns)})"

value = Decimal("3.14")
batch_size = 2000
data = [tuple([value for c in range(num_columns)]) for r in range(batch_size)]

process = psutil.Process()

def printmem(msg):
    mb = process.memory_info().vms / 1048576
    print(f"{msg}: {mb:0.1f} MiB")

printmem("before loop")
for loop in range(3):
    crsr.executemany(sql, data)
    cnxn.commit()
    printmem(f"after {loop + 1} iteration{'s' if loop else ''}")

console output:

before loop: 12.5 MiB
after 1 iteration: 31.6 MiB
after 2 iterations: 50.6 MiB
after 3 iterations: 70.7 MiB

However, with value = 3.14 instead of value = Decimal("3.14") we see:

before loop: 12.5 MiB
after 1 iteration: 12.5 MiB
after 2 iterations: 12.5 MiB
after 3 iterations: 12.5 MiB
gordthompson commented 2 years ago

Additional information: The above code was run using pyodbc 4.0.32 (current release). It also reproduces the issue under the current master branch and also using the modifications from PR #703 from @Mizaro , so it looks like it really is a new issue.

v-chojas commented 2 years ago

https://github.com/mkleehammer/pyodbc/blob/master/src/params.cpp#L554 This might be the leaked object. There's no Py_XDECREF for it.

gordthompson commented 2 years ago

@v-chojas - Thanks for the pointer. I added a Py_XDECREF for it but the results remain unchanged

@@ -564,10 +564,11 @@ static int PyToCType(Cursor *cur, unsigned char **outbuf, PyObject *cell, ParamI
         PyObject *scaledDecimal = PyObject_CallObject((PyObject*)cell->ob_type, args);
         PyObject *digitLong = PyNumber_Long(scaledDecimal);

         Py_XDECREF(args);
         Py_XDECREF(scaledDecimal);
+        Py_XDECREF(newDigits);
         Py_XDECREF(cellParts);

         pNum->precision = (SQLCHAR)pi->ColumnSize;
         pNum->scale = (SQLCHAR)pi->DecimalDigits;
gordthompson commented 2 years ago

@v-chojas - My apologies. PyCharm didn't see the change until I did

pip install -e /home/gord/git/pyodbc/

Now (on Linux) I get

before loop: 39.8 MiB
after 1 iteration: 40.2 MiB
after 2 iterations: 46.8 MiB
after 3 iterations: 46.8 MiB
after 4 iterations: 46.8 MiB
after 5 iterations: 46.8 MiB
after 6 iterations: 46.8 MiB
after 7 iterations: 46.8 MiB
after 8 iterations: 46.8 MiB
keitherskine commented 2 years ago

Good analysis @gordthompson , and nice find @v-chojas .

gordthompson commented 2 years ago

@ddavis226 - If you would like to try out the patch you can

pip install git+https://github.com/gordthompson/pyodbc@decimal_leak
ddavis226 commented 2 years ago

I would be oh-so-grateful!! Doing so now.

ddavis226 commented 2 years ago

This is working as expected now! Thank you so much for the incredibly quick turnaround!
I am assuming I close the issue now?

jdudleyie commented 2 years ago

Hello,

I have the same issue, I assume this fix will be released in the next version?

Is there any way I can use this fix before then? Or is there any workaround?

Thanks.

gordthompson commented 2 years ago

@jdudleyie - You can install pyodbc from the latest master branch

pip install git+https://github.com/mkleehammer/pyodbc
jdudleyie commented 2 years ago

Thanks @gordthompson. I assume I could do this for testing that it resolves the issue, but I should not do this in production as there are other commits since the last release so it’s probably not safe?

If I wanted to use this fix in production is there a way for me to get or create a build of the latest current released version plus only this new fix? Thanks.

gordthompson commented 2 years ago

there are other commits since the last release so it’s probably not safe

I don't know why you would assume that. Those changes have been vetted and accepted so they will be included in the next official release anyway (whenever that is).

jdudleyie commented 2 years ago

Thank you, I didn’t realise, that’s good to know. (If each of those commits is vetted then I am wondering why a new release is not automatically created for each merge to master so they are easily available to consumers.)

I suppose I am used to using a specific version of a library in production and consciously choose when to upgrade, so if I was to change my docker image/production deployment process temporarily to pip install pyodbc master so as to include this fix, then at any time a new commit could come in that may affect something in production and it may be harder to track it down, but if I had recently upgraded package versions, then at least I would know that’s something that I had changed.

Is there a way to pip install a specific commit on the pyodbc master branch?

Apologies for all the questions, just trying to assess the safest way to get this change into my production environment in order to resolve the memory leak and I’m not that familiar with these options that don’t involve bumping a package version.

v-chojas commented 2 years ago

You can check out specific commit ID, instead of only the latest of the master.

jdudleyie commented 2 years ago

Running the below worked for me and appears to resolve the memory leak, both on Windows and also in Linux, thanks for the help.

pip install git+https://github.com/mkleehammer/pyodbc@e67ad89a5ed8066b7f1272b96688da0a5c24c1c5