mikependon / RepoDB

A hybrid ORM library for .NET.
Apache License 2.0
1.7k stars 125 forks source link

Enhancement: Refactor Batch Operations SQL Statements #1143

Open mikependon opened 1 year ago

mikependon commented 1 year ago

Describe the enhancement

As a response to the #1142, it is important for the library to limit the usage of the memory while it is caching the packed-statements per entity/model when calling the batch operations. This is to ensure give more resource for the application to utilize the memory on other purpose.

The rationale

What? In the mentioned issue above, in the PGSQL extended library, a user has reported a memory leaks. They have an entity model that corresponds to the table with 112 columns. During the call to the InsertAll operation with batchSize of 500, the library suddenly spike the memory to 3GB.

Why? Given with this parameter, RepoDB will/might possibly cache the max of 500 big packed-statement per row-size for such big table. This is only for a single entity and on this operation, not yet for other enties for other batch operation (i.e.: MergeAll, UpdateAll)

In the current version of the library, when you call the InsertAll operation, it caches a packed-statement based on the number of rows passed in the operation. (This is also true to MergeAll and UpdateAll)

For a better elaboration, if a user had passed 3 entities in the operation, the statement below will be created.

INSERT INTO [Table] (Col1, ... Col112) VALUES (@Col1_1, ..., @Col112_1); RETURN SCOPE_IDENTITY();
...
INSERT INTO [Table] (Col1, ... Col112) VALUES (@Col1_3, ..., @Col112_3); RETURN SCOPE_IDENTITY();

Such statement will be cached into a host memory for the operation with 3 rows. It will be reused when the same entity model (or table) is passed again in the future, but is only for 3 rows.

Then, if a user had passed 5 entities in the operation, the statement below will be cached as well into the host memory.

INSERT INTO [Table] (Col1, ... Col112) VALUES (@Col1_1, ..., @Col112_1); RETURN SCOPE_IDENTITY();
...
...
...
INSERT INTO [Table] (Col1, ... Col112) VALUES (@Col1_5, ..., @Col112_5); RETURN SCOPE_IDENTITY();

If the user had set the batchSize to 500, a possible 500 packed-statements of the above statements will be saved into the host memory, resulting to a bigger memory requirements/consumptions.

The bigger the rows passed, the bigger the statement it will cached.

Conclusion

As per the report and also to the screenshots we provided during the investigation, it has utilized to the max of 3GB memory for such single entity. This alarming as it will require the application to issue high resource limit in case they need to utilize the batch capability of the library.

Though, this is not an issue, but this is something requires an attention and revisits to optimize the memory usage.

mikependon commented 1 year ago

Referencing: #380

cajuncoding commented 1 year ago

Yeah 3GB is unacceptably high…but it also feels odd that an otherwise simple string cache would be so large in memory.

Have you tried a POC to read the byte (array) size of the cached string to just to confirm the actual string size approaches that… while not apples to apples it can help grasp the actual size of the cached string because 3GB would hold more than the entire encyclopedia which used to fit (with images) on a CD-ROM…

mikependon commented 1 year ago

Have you tried a POC to read the byte (array) size of the cached string to just to confirm the actual string size approaches that

Yes, we will do this.

In addition, as of writing this, we are now working and investigating on which approach is best to do. Since we cannot eliminate the required packed-statement caching when executing the batch operations, we should somehow put focus into the packed-statements composition.

Initial resolution of us would be (ideally), we should create the following SQL in every batch operation Row-N insertion.

INSERT INTO [TableName] (Col1, ..., Col112)
VALUES
(@p1, ..., @p112),
(@p113, ..., @p224),
(@p225, ..., @p336);

But, renaming from the @Col1_1 to @p1 is a MAJOR change and would maybe challenge the library's application architecture approach. But, the below's approach can still be meet.

INSERT INTO [TableName] (Col1, ..., Col112)
RETURNING INSERTED.[PK/IK]
VALUES
(@Col1_1, ..., @Col112_1),
(@Col1_2, ..., @Col112_2),
(@Col1_3, ..., @Col112_3);

It would eliminate more than 75% of the size of the packed-statement as we analyzed.

mikependon commented 1 year ago

See these 2 packed statements. The new one is only around 50% of size of the old one. We only have eliminated around 55% of size as we are only using small table on our Regression Test. The bigger the table and Row-N the higher the bytes it will be eliminated.

The challenge to us here is - how do we ensure that RETURNING [Id] is on proper order? This is the one you reported in the past where we have solved using the extract __RepoDb_OrderColumn_10 column

InsertAll-New-Packed-Statement.txt InsertAll-Old-Packed-Statement.txt

mikependon commented 1 year ago

Hey @cajuncoding , there is a little bit of complexity atleast for SQL Server. When introducing the INSERT BULK approach in the packed-statement like below.

INSERT INTO [Table] ([Column1])
OUTPUT [INSERTED].[Id]
VALUES
(@Column1_0),
(@Column1_1),
(@Column1_2);

There is no way for me to order the OUTPUT [INSERTED].[Id] based on the extra parameter named @__RepoDb_OrderColumn_X.

Ofcourse, unless I hacked it like below.

WITH CTE AS
(
    SELECT @Column1_0 AS [Column1], @__RepoDb_OrderColumn_0 AS [__RepoDb_OrderColumn]
    UNION SELECT @Column1_1, @__RepoDb_OrderColumn_1
    UNION SELECT @Column1_2, @__RepoDb_OrderColumn_2
)
INSERT INTO [Table] ([Column1])
OUTPUT [INSERTED].[Id]
SELECT [Column1] FROM CTE ORDER BY [__RepoDb_OrderColumn];

Since you were the one who had requested this capability, would you be able to validate that the 1st statement above is really contaminating the ordering of the insertion? I tried it many times, but it sounds to me it is not affecting the order. In short, the result is the same with 2nd statement and with the old approach.

Would you be of help?

cajuncoding commented 1 year ago

@mikependon , I can’t remember exactly how I was validating it at the time but it was definitely an issue…but as you alluded to that was related to using BulkInsert() (based on SqlBulkCopy) as the higher performing alternative to InsertAll().

However the issue still stands because SQL Server makes no guarantee at all of the returned item order unless you tell it exactly how to order (via ORDER BY clause).

A little google foo yielded this excellent Q&A on StackOverflow with a great response…it may be that the SQL Server implementation will need to use a merge statement vs simple insert, but it’s just a tweak on the script because all values can still be packed and passed using the same VALUES (a, b, c), (d,e,f)… syntax as an input into the merge and potentially a #OutputTempTable.

https://dba.stackexchange.com/a/155737/168595

Thoughts on this?

mikependon commented 1 year ago

@cajuncoding yeah, I read the link you shared and it is such a goos comment feom the community to not rely on the default order of the output based on the input. So, it is still recommendable to add that extra steps, which I would also do. Thanks mate 🚀

mikependon commented 1 year ago

There seems to have a challenge in the PGSQL, a working query in SQL Server is failing in PGSQL.

image

Reference: https://twitter.com/mike_pendon/status/1638654274237497346