Open roji opened 8 years ago
Agree, this would be great. Here is a link which describes how it can be done with the ADO.NET provider for SQL Server. Some sample code taken from that page:
using (SqlBulkCopy copy = new SqlBulkCopy(conn))
{
copy.DestinationTableName = "Quotes";
DataTable table = new DataTable("Quotes");
table.Columns.Add("Symbol", typeof(string));
table.Columns.Add("QuoteDate", typeof(DateTime));
table.Columns.Add("Price", typeof(decimal));
for (int i = 0; i < nRecords; ++i)
{
table.Rows.Add("MSFT", new DateTime(2000, 1, 1).AddDays(i), 12.3m);
}
copy.WriteToServer(table);
}
If we could get an API that is basically swapping SqlBulkCopy
there to NpgsqlBulkCopy
instead, that would be awesome. (Of course, anyone who has worked with DataTable
s much knows that they are awkward and painful, but perhaps for this kind of use case it makes at least some sense...)
@perlun, I think there's a bit of confusion here. Npgsql already has an efficient bulk copy API which provides the same benefits as SqlBulkCopy.
This issue is about using that bulk copy API from Npgsql's Entity Framework Core provider. EF Core already supports batched updates, where if you do several updates and then call SaveChanges()
, these will all be done in a single roundtrip (rather than a roundtrip-per-update). It's technically possible to further optimize this, checking if a large group of these updates happen to be inserts and to use bulk copy in that case.
This is far from a trivial change, and we need to benchmark first to see if the performance benefits (i.e. bulk copy vs. batched inserts) justify it.
As part of some POC i did some performance optimizations around this library (Mainly ported the code from the sql server provider): https://gist.github.com/zlepper/7f55ab76547d81eb6eb403ad4feab06b
You will probably want someone who knows this much better than me to actually look it over and do it, however i hope it can be a help to optimize inserts :)
From personal experience: It doesn't speed up postgres a whole lot (The difference i saw in local testing could be attributed to chance). However in the case of Cockroach db, I saw up to 15x throughput, simply because it can bulk transactions much better, and it will probably matter even more for distributed clusters (I was just testing with a 3 node docker cluster on the same machine).
@zlepper can you provide some context on exactly what your POC is doing? The provider already batches multiple updates/inserts into a single DbCommand, which means that everything happens in a single round-trip.
The issue would be to implement insert specifically via Npgsql's binary copy API, which is a super-optimized, PostgreSQL-specific way to get data into the database.
@roji Yes, I'm basically trying to insert a couple of billion rows into the database. So the optimization i did converts (With just a lot more rows per save):
insert into foo(a, b)
values ('a', 'b');
insert into foo(a, b)
values ('c', 'd');
insert into foo(a, b)
values ('e', 'f');
into
insert into foo(a,b)
values ('a', 'b'),
('c', 'd'),
('e', 'f');
(Of cause it's still parametized, i just didn't want to type that out, since that doesn't matter for the specific changes i did)
So it's still just one roundtrip, but each statement becomes a transaction (If i have understood things correctly, i'm not that sharp on database things).
My purpose of inserting data doesn't actually have anything to do with inserting it, i just need the data there, to test how some different databases handles it when there is a lot of data already.
Do also watch out for the binary copy. In my experience that doesn't work in cockroach (At least the thing Z.EntityFrameworkExtensions
does doesn't work.
For my specific case i ended using the actual import apis, since they are by far the fastest, but that's a whole other thing.
@zlepper When using the former (without your change), have you tried turning on automatic statement preparation? This has the potential of making the inserts run much faster than your modified version.
Regardless, if you really are inserting billions of rows, consider bypassing EF and coding directly against the Npgsql COPY API - you'll definitely get the best perf possible. You can still use EF Core for all other access to the database.
@zlepper When using the former (without your change), have you tried turning on automatic statement preparation? This has the potential of making the inserts run much faster than your modified version.
Would this then help even more with the bulk version also? But no, I was not aware that was a thing.
Regardless, if you really are inserting billions of rows, consider bypassing EF and coding directly against the Npgsql COPY API - you'll definitely get the best perf possible. You can still use EF Core for all other access to the database.
I mean, that is basically what I ended up doing, just using IMPORT
against an aspnet endpoint that dynamically generates csv data. Currently i'm getting about 250.000 rows per second with this, so I'm pretty satisfied, it does however mean that there are no validation being run what-so-ever. Which I imagine COPY
will actually do.
I have no doubt your suggestion will help with Postgres, however my main problem was actually the performance in Cockroachdb, where the bottleneck comes from the consistency (Which requires communication between nodes), and there the problem is insert statements with just one pair of values.
Also very interestingly, i cannot find the COPY
statement in the Cockroach documentation, however it is apparently supported according to this PR: https://github.com/cockroachdb/cockroach/pull/8756.
Would this then help even more with the bulk version also? But no, I was not aware that was a thing.
It might, if you execute the same command (with the same number of parameters) multiple times.
Yeah, if you're using Cockroach then everything definitely needs to be checked there specifically... I have no idea exactly what they support.
Would this then help even more with the bulk version also? But no, I was not aware that was a thing.
It might, if you execute the same command (with the same number of parameters) multiple times.
Considering that we EF is currently chunking our insert sizes at 1000 rows per query, then yes, we would definitely fit under that :D
But either way, if you can use what I send, feel free. If not, feel free not to :) Just thought it might help some of the general investigation work here :)
Note: https://github.com/dotnet/efcore/issues/9118 is planned for EF Core 7.0, which would unlock this.
There are various methods we could use:
@YohDeadfall did some great benchmarking in https://github.com/npgsql/npgsql/issues/2779#issuecomment-573439342. tl;dr insert with arrays work really well up until beyond 100 rows, at which point COPY overtakes it.
One reason why COPY may be slow for less rows, is that it currently requires multiple roundtrips, i.e. BeginBinaryImport and the terminating Complete (this alsomeans that whether is worth doing COPY depends on latency to PostgreSQL). We could think about changing the ADO.NET COPY import API to be single-roundtrip, i.e. make BeginBinaryImport do nothing (a bit like BeginTransaction), and only flush and wait on Complete. We may want this to be an opt-in, not sure. I'm also not sure whether COPY always works and has no limitations, so an opt-out is a good idea (probably a minimum row threshold for switching to COPY, which can be set to 0 to disable).
Note also https://github.com/dotnet/efcore/issues/27333, which is about a special EF Core API for bulk importing. So it may make sure to implement that API with COPY, and to implement SaveChanges with insert with arrays. Though if we do implement single-roundtrip-COPY, COPY may be a better option all the time - experiment.
Note that even if we have single-roundtrip-COPY at the ADO layer, it's still not going to be possible to batch that with other, non-COPY commands at the EF level. We'd need some way to e.g. integrate a COPY operation inside an NpgsqlBatch, which would mean representing a COPY operation via NpgsqlBatchCommand. That would be a whole new API which pulls rows from the user (more similar to SqlBulkCopy) rather than today's API, where the user pushes rows into the database. Or allow the user to embed a lambda in the special NpgsqlBatchCommand, where they push the rows with today's API.
It's an interesting idea, but a bit far-fetched. Probably just go with array insert for now.
One more argument against COPY here, is that if the user really is importing a massive amount of rows - enough for the difference between array insert and COPY to be significant - they also want to avoid the overhead associated with change tracking in EF Core. So they want to use a dedicated import API as in https://github.com/dotnet/efcore/issues/27333, which bypasses all that.
Updated benchmark with the various candidate methods, based on @YohDeadfall's benchmarks in https://github.com/npgsql/npgsql/issues/2779#issuecomment-573439342:
Method | NumRows | Mean | Error | StdDev | Median | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
Batched_inserts | 1 | 1.060 ms | 0.0214 ms | 0.0621 ms | 1.072 ms | 1.00 | 0.00 |
Insert_with_multiple_rows | 1 | 1.085 ms | 0.0181 ms | 0.0169 ms | 1.082 ms | 1.13 | 0.17 |
Insert_with_multiple_rows_sorted | 1 | 1.099 ms | 0.0197 ms | 0.0175 ms | 1.098 ms | 1.12 | 0.15 |
Insert_with_arrays | 1 | 1.088 ms | 0.0217 ms | 0.0633 ms | 1.100 ms | 1.03 | 0.05 |
Insert_with_arrays_sorted | 1 | 1.130 ms | 0.0180 ms | 0.0159 ms | 1.127 ms | 1.15 | 0.16 |
Copy | 1 | 1.188 ms | 0.0223 ms | 0.0209 ms | 1.189 ms | 1.23 | 0.18 |
Batched_inserts | 2 | 1.109 ms | 0.0220 ms | 0.0629 ms | 1.120 ms | 1.00 | 0.00 |
Insert_with_multiple_rows | 2 | 1.092 ms | 0.0217 ms | 0.0193 ms | 1.093 ms | 1.08 | 0.17 |
Insert_with_multiple_rows_sorted | 2 | 1.121 ms | 0.0222 ms | 0.0645 ms | 1.133 ms | 1.01 | 0.10 |
Insert_with_arrays | 2 | 1.129 ms | 0.0163 ms | 0.0144 ms | 1.130 ms | 1.11 | 0.17 |
Insert_with_arrays_sorted | 2 | 1.160 ms | 0.0230 ms | 0.0192 ms | 1.162 ms | 1.15 | 0.18 |
Copy | 2 | 1.223 ms | 0.0243 ms | 0.0639 ms | 1.234 ms | 1.11 | 0.03 |
Batched_inserts | 5 | 1.411 ms | 0.0205 ms | 0.0192 ms | 1.406 ms | 1.00 | 0.00 |
Insert_with_multiple_rows | 5 | 1.160 ms | 0.0311 ms | 0.0886 ms | 1.181 ms | 0.76 | 0.10 |
Insert_with_multiple_rows_sorted | 5 | 1.288 ms | 0.0243 ms | 0.0238 ms | 1.277 ms | 0.91 | 0.02 |
Insert_with_arrays | 5 | 1.068 ms | 0.0224 ms | 0.0659 ms | 1.082 ms | 0.74 | 0.07 |
Insert_with_arrays_sorted | 5 | 1.108 ms | 0.0213 ms | 0.0270 ms | 1.117 ms | 0.78 | 0.03 |
Copy | 5 | 1.093 ms | 0.0218 ms | 0.0555 ms | 1.098 ms | 0.79 | 0.07 |
Batched_inserts | 10 | 1.406 ms | 0.0177 ms | 0.0165 ms | 1.404 ms | 1.00 | 0.00 |
Insert_with_multiple_rows | 10 | 1.098 ms | 0.0202 ms | 0.0189 ms | 1.098 ms | 0.78 | 0.01 |
Insert_with_multiple_rows_sorted | 10 | 1.110 ms | 0.0248 ms | 0.0727 ms | 1.128 ms | 0.78 | 0.08 |
Insert_with_arrays | 10 | 1.112 ms | 0.0150 ms | 0.0141 ms | 1.110 ms | 0.79 | 0.01 |
Insert_with_arrays_sorted | 10 | 1.140 ms | 0.0213 ms | 0.0199 ms | 1.147 ms | 0.81 | 0.02 |
Copy | 10 | 1.139 ms | 0.0226 ms | 0.0612 ms | 1.143 ms | 0.76 | 0.08 |
Batched_inserts | 100 | 2.405 ms | 0.0441 ms | 0.0391 ms | 2.409 ms | 1.00 | 0.00 |
Insert_with_multiple_rows | 100 | 1.373 ms | 0.0097 ms | 0.0086 ms | 1.374 ms | 0.57 | 0.01 |
Insert_with_multiple_rows_sorted | 100 | 1.446 ms | 0.0275 ms | 0.0270 ms | 1.451 ms | 0.60 | 0.02 |
Insert_with_arrays | 100 | 1.333 ms | 0.0256 ms | 0.0350 ms | 1.324 ms | 0.56 | 0.02 |
Insert_with_arrays_sorted | 100 | 1.407 ms | 0.0143 ms | 0.0111 ms | 1.405 ms | 0.59 | 0.01 |
Copy | 100 | 1.343 ms | 0.0124 ms | 0.0116 ms | 1.341 ms | 0.56 | 0.01 |
Batched_inserts | 500 | 6.781 ms | 0.1345 ms | 0.2425 ms | 6.753 ms | 1.00 | 0.00 |
Insert_with_multiple_rows | 500 | 2.205 ms | 0.0437 ms | 0.0832 ms | 2.188 ms | 0.33 | 0.02 |
Insert_with_multiple_rows_sorted | 500 | 2.333 ms | 0.0460 ms | 0.0909 ms | 2.351 ms | 0.34 | 0.02 |
Insert_with_arrays | 500 | 2.002 ms | 0.0241 ms | 0.0225 ms | 1.995 ms | 0.29 | 0.01 |
Insert_with_arrays_sorted | 500 | 2.187 ms | 0.0431 ms | 0.1008 ms | 2.209 ms | 0.32 | 0.02 |
Copy | 500 | 2.154 ms | 0.0429 ms | 0.0401 ms | 2.153 ms | 0.31 | 0.01 |
Batched_inserts | 1000 | 12.433 ms | 0.2481 ms | 0.5651 ms | 12.372 ms | 1.00 | 0.00 |
Insert_with_multiple_rows | 1000 | 2.981 ms | 0.0581 ms | 0.0795 ms | 2.993 ms | 0.24 | 0.01 |
Insert_with_multiple_rows_sorted | 1000 | 3.360 ms | 0.0672 ms | 0.1159 ms | 3.373 ms | 0.27 | 0.01 |
Insert_with_arrays | 1000 | 2.683 ms | 0.0519 ms | 0.0728 ms | 2.684 ms | 0.22 | 0.01 |
Insert_with_arrays_sorted | 1000 | 3.088 ms | 0.0603 ms | 0.0619 ms | 3.084 ms | 0.26 | 0.01 |
Copy | 1000 | 3.457 ms | 0.0686 ms | 0.1166 ms | 3.492 ms | 0.28 | 0.01 |
Batched_inserts | 10000 | 102.829 ms | 2.0541 ms | 3.0108 ms | 102.639 ms | 1.00 | 0.00 |
Insert_with_multiple_rows | 10000 | 21.879 ms | 0.7110 ms | 2.0628 ms | 21.330 ms | 0.21 | 0.02 |
Insert_with_multiple_rows_sorted | 10000 | 25.269 ms | 0.8346 ms | 2.3404 ms | 25.181 ms | 0.23 | 0.02 |
Insert_with_arrays | 10000 | 18.460 ms | 1.2062 ms | 3.5185 ms | 17.058 ms | 0.22 | 0.03 |
Insert_with_arrays_sorted | 10000 | 18.913 ms | 0.4442 ms | 1.2601 ms | 18.841 ms | 0.18 | 0.01 |
Copy | 10000 | 7.258 ms | 0.1450 ms | 0.2724 ms | 7.280 ms | 0.07 | 0.00 |
tl;dr we should do Insert_with_arrays starting from 2-5 rows (benchmark to get the exact threshold).
Notes:
/cc @AndriySvyryd
An important problem with the above optimizations, is that they don't guarantee the ordering of database-generated values returned via the RETURNING clause; this means we can't match up the returned IDs to their corresponding entity instances client-side.
The solution here would probably be to use MERGE in the same way as SQL Server - MERGE is coming to PostgreSQL 15.
This is great stuff! It's always been a pain point with DB-seeding scripts that utilize business logic for seeding and you end up seeding your graph in 20-30 seconds then waiting 5 minutes for the EF Core inserts 😅
@douglasg14b in general, EF's seeding feature isn't suitable for use with many rows for various reasons (e.g. seeding is present in the migration snapshots on disk, which would become huge). So I don't think this is very relevant for seeding.
@roji I'm not talking about using EFs seeding features. Just logic that generates data & state, and many entities, in C#, and with DbContext.
Note that for inserting a large number of rows (bulk insert/import), https://github.com/dotnet/efcore/issues/27333 is in general a better approach (expose Npgsql's binary COPY via a standard EF API).
The SQL Server provider has some sort of bulk insert optimization, look into it. At the most extreme we can do a COPY operation, although that would probably be going a bit overboard.