linq2db / linq2db.EntityFrameworkCore

Bring power of Linq To DB to Entity Framework Core projects
MIT License
462 stars 38 forks source link

In a repeatitive bulkcopy scenario how to handle duplicate data? #215

Closed aloksharma1 closed 2 years ago

aloksharma1 commented 2 years ago

Hi, i am using bulk copy to process a bunch of data from different tables to copy in a single table, problem is whenever bulkcopy background task reinitiates it duplicates the data (which i previously thought would go for merge).

i am concerned about performance, so i need to know insert/merge/update vs bulk copy which route should i go? judging that rows will increase over time (maybe upto ~50k).

thanks for the help

sdanyliv commented 2 years ago

From first sign, I would suggest to use the following scenario:

  1. BulkCopy records into temporary table
  2. Merge data to destination.
using var db = ctx.CreateLinqToDBConnection(); 
using var temp = db.CreateTempTable(items);

db.GetTable<SomeDest>()
   .Merge()
   .Using(temp)
   .On((t, s) => t.Id == s.Id)
   .InsertWhenNotMatched()
   .Merge();

It is sample syntax and I hope you will find needed overloads by yourself.

Shane32 commented 2 years ago

I bulk copy to a temporary table and then merge into the target table. Then the temporary table is deleted. Pretty much exactly what @sdanyliv posted above. (The using will delete the temporary table.)

Bulk+merge is also useful if you need to preserve the ids of the existing data while merging on another field. For example if you have a list of part numbers and prices, but no part ids, you can import the data to a temporary table, then merge on the part number updating the price, and if no match is found it will create a new row with the part number and price. This preserves the old row and the old row's id, which is especially important if you have other tables that relate to that data.

aloksharma1 commented 2 years ago

yes i have used this method before, but is their any difference between this & BulkCopyAsync under the hood, thats what i want to know.

Shane32 commented 2 years ago

As for performance, creating the temporary table, bulk inserting rows and dropping the table incurs no significant overhead to databases, as generally relational databases are optimized well for those operations. Those operations also do not affect locks on any other pending operations. The merge is where you need to be careful. So long as it is merging on an indexed field, such as the primary key in the above sample, you should be fine. But of course the database will still need to read every row in the target table, and will place locks on updates appropriately. I usually wrap these operations in a transaction but here also you will need to decide what is appropriate for your use case.

Shane32 commented 2 years ago

yes i have used this method before, but is their any difference between this & BulkCopyAsync under the hood, thats what i want to know.

No; CreateTempTable(items) simply creates a temp table and then calls BulkCopyAsync to populate it.

aloksharma1 commented 2 years ago

As for performance, creating the temporary table, bulk inserting rows and dropping the table incurs no significant overhead to databases, as generally relational databases are optimized well for those operations. Those operations also do not affect locks on any other pending operations. The merge is where you need to be careful. So long as it is merging on an indexed field, such as the primary key in the above sample, you should be fine. But of course the database will still need to read every row in the target table, and will place locks on updates appropriately. I usually wrap these operations in a transaction but here also you will need to decide what is appropriate for your use case.

All my operations are transaction bound, so no worries on that part.

Shane32 commented 2 years ago

I'm pretty sure there's an asynchronous overload of CreateTempTable, probably CreateTempTableAsync. I can't recall for sure. Of course I would recommend use of the asynchronous methods to create the temp table and populate it.

aloksharma1 commented 2 years ago

yes i have used this method before, but is their any difference between this & BulkCopyAsync under the hood, thats what i want to know.

No; CreateTempTable(items) simply creates a temp table and then calls BulkCopyAsync to populate it. thanks