Optimize multiple insertions

roji commented 8 years ago

The SQL Server provider has some sort of bulk insert optimization, look into it. At the most extreme we can do a COPY operation, although that would probably be going a bit overboard.

perlun commented 7 years ago

Agree, this would be great. Here is a link which describes how it can be done with the ADO.NET provider for SQL Server. Some sample code taken from that page:

    using (SqlBulkCopy copy = new SqlBulkCopy(conn))
    {
        copy.DestinationTableName = "Quotes";
        DataTable table = new DataTable("Quotes");
        table.Columns.Add("Symbol", typeof(string));
        table.Columns.Add("QuoteDate", typeof(DateTime));
        table.Columns.Add("Price", typeof(decimal));

        for (int i = 0; i < nRecords; ++i)
        {
            table.Rows.Add("MSFT", new DateTime(2000, 1, 1).AddDays(i), 12.3m);
        }

        copy.WriteToServer(table);
    }

If we could get an API that is basically swapping SqlBulkCopy there to NpgsqlBulkCopy instead, that would be awesome. (Of course, anyone who has worked with DataTables much knows that they are awkward and painful, but perhaps for this kind of use case it makes at least some sense...)

roji commented 7 years ago

@perlun, I think there's a bit of confusion here. Npgsql already has an efficient bulk copy API which provides the same benefits as SqlBulkCopy.

This issue is about using that bulk copy API from Npgsql's Entity Framework Core provider. EF Core already supports batched updates, where if you do several updates and then call SaveChanges(), these will all be done in a single roundtrip (rather than a roundtrip-per-update). It's technically possible to further optimize this, checking if a large group of these updates happen to be inserts and to use bulk copy in that case.

This is far from a trivial change, and we need to benchmark first to see if the performance benefits (i.e. bulk copy vs. batched inserts) justify it.

zlepper commented 4 years ago

As part of some POC i did some performance optimizations around this library (Mainly ported the code from the sql server provider): https://gist.github.com/zlepper/7f55ab76547d81eb6eb403ad4feab06b

You will probably want someone who knows this much better than me to actually look it over and do it, however i hope it can be a help to optimize inserts :)

From personal experience: It doesn't speed up postgres a whole lot (The difference i saw in local testing could be attributed to chance). However in the case of Cockroach db, I saw up to 15x throughput, simply because it can bulk transactions much better, and it will probably matter even more for distributed clusters (I was just testing with a 3 node docker cluster on the same machine).

roji commented 4 years ago

@zlepper can you provide some context on exactly what your POC is doing? The provider already batches multiple updates/inserts into a single DbCommand, which means that everything happens in a single round-trip.

The issue would be to implement insert specifically via Npgsql's binary copy API, which is a super-optimized, PostgreSQL-specific way to get data into the database.

zlepper commented 4 years ago

@roji Yes, I'm basically trying to insert a couple of billion rows into the database. So the optimization i did converts (With just a lot more rows per save):

insert into foo(a, b) 
values ('a', 'b');
insert into foo(a, b)
values ('c', 'd');
insert into foo(a, b)
values ('e', 'f');

into

insert into foo(a,b)
values ('a', 'b'),
('c', 'd'),
('e', 'f');

(Of cause it's still parametized, i just didn't want to type that out, since that doesn't matter for the specific changes i did)

So it's still just one roundtrip, but each statement becomes a transaction (If i have understood things correctly, i'm not that sharp on database things).

My purpose of inserting data doesn't actually have anything to do with inserting it, i just need the data there, to test how some different databases handles it when there is a lot of data already.

Do also watch out for the binary copy. In my experience that doesn't work in cockroach (At least the thing Z.EntityFrameworkExtensions does doesn't work.

For my specific case i ended using the actual import apis, since they are by far the fastest, but that's a whole other thing.

roji commented 4 years ago

@zlepper When using the former (without your change), have you tried turning on automatic statement preparation? This has the potential of making the inserts run much faster than your modified version.

Regardless, if you really are inserting billions of rows, consider bypassing EF and coding directly against the Npgsql COPY API - you'll definitely get the best perf possible. You can still use EF Core for all other access to the database.

zlepper commented 4 years ago

@zlepper When using the former (without your change), have you tried turning on automatic statement preparation? This has the potential of making the inserts run much faster than your modified version.

Would this then help even more with the bulk version also? But no, I was not aware that was a thing.

Regardless, if you really are inserting billions of rows, consider bypassing EF and coding directly against the Npgsql COPY API - you'll definitely get the best perf possible. You can still use EF Core for all other access to the database.

I mean, that is basically what I ended up doing, just using IMPORT against an aspnet endpoint that dynamically generates csv data. Currently i'm getting about 250.000 rows per second with this, so I'm pretty satisfied, it does however mean that there are no validation being run what-so-ever. Which I imagine COPY will actually do.

I have no doubt your suggestion will help with Postgres, however my main problem was actually the performance in Cockroachdb, where the bottleneck comes from the consistency (Which requires communication between nodes), and there the problem is insert statements with just one pair of values.

Also very interestingly, i cannot find the COPY statement in the Cockroach documentation, however it is apparently supported according to this PR: https://github.com/cockroachdb/cockroach/pull/8756.

roji commented 4 years ago

Would this then help even more with the bulk version also? But no, I was not aware that was a thing.

It might, if you execute the same command (with the same number of parameters) multiple times.

Yeah, if you're using Cockroach then everything definitely needs to be checked there specifically... I have no idea exactly what they support.

zlepper commented 4 years ago

Would this then help even more with the bulk version also? But no, I was not aware that was a thing.

It might, if you execute the same command (with the same number of parameters) multiple times.

Considering that we EF is currently chunking our insert sizes at 1000 rows per query, then yes, we would definitely fit under that :D

But either way, if you can use what I send, feel free. If not, feel free not to :) Just thought it might help some of the general investigation work here :)

roji commented 2 years ago

Note: https://github.com/dotnet/efcore/issues/9118 is planned for EF Core 7.0, which would unlock this.

There are various methods we could use:

Multi-row insert (not good for prepared statements when the number of rows varies)
Insert with arrays (and unnest)
COPY (binary import).
Multiple single-row inserts (the current default)

@YohDeadfall did some great benchmarking in https://github.com/npgsql/npgsql/issues/2779#issuecomment-573439342. tl;dr insert with arrays work really well up until beyond 100 rows, at which point COPY overtakes it.

One reason why COPY may be slow for less rows, is that it currently requires multiple roundtrips, i.e. BeginBinaryImport and the terminating Complete (this alsomeans that whether is worth doing COPY depends on latency to PostgreSQL). We could think about changing the ADO.NET COPY import API to be single-roundtrip, i.e. make BeginBinaryImport do nothing (a bit like BeginTransaction), and only flush and wait on Complete. We may want this to be an opt-in, not sure. I'm also not sure whether COPY always works and has no limitations, so an opt-out is a good idea (probably a minimum row threshold for switching to COPY, which can be set to 0 to disable).

Note also https://github.com/dotnet/efcore/issues/27333, which is about a special EF Core API for bulk importing. So it may make sure to implement that API with COPY, and to implement SaveChanges with insert with arrays. Though if we do implement single-roundtrip-COPY, COPY may be a better option all the time - experiment.

roji commented 2 years ago

Note that even if we have single-roundtrip-COPY at the ADO layer, it's still not going to be possible to batch that with other, non-COPY commands at the EF level. We'd need some way to e.g. integrate a COPY operation inside an NpgsqlBatch, which would mean representing a COPY operation via NpgsqlBatchCommand. That would be a whole new API which pulls rows from the user (more similar to SqlBulkCopy) rather than today's API, where the user pushes rows into the database. Or allow the user to embed a lambda in the special NpgsqlBatchCommand, where they push the rows with today's API.

It's an interesting idea, but a bit far-fetched. Probably just go with array insert for now.

roji commented 2 years ago

One more argument against COPY here, is that if the user really is importing a massive amount of rows - enough for the difference between array insert and COPY to be significant - they also want to avoid the overhead associated with change tracking in EF Core. So they want to use a dedicated import API as in https://github.com/dotnet/efcore/issues/27333, which bypasses all that.

roji commented 2 years ago

Updated benchmark with the various candidate methods, based on @YohDeadfall's benchmarks in https://github.com/npgsql/npgsql/issues/2779#issuecomment-573439342:

Method	NumRows	Mean	Error	StdDev	Median	Ratio	RatioSD
Batched_inserts	1	1.060 ms	0.0214 ms	0.0621 ms	1.072 ms	1.00	0.00
Insert_with_multiple_rows	1	1.085 ms	0.0181 ms	0.0169 ms	1.082 ms	1.13	0.17
Insert_with_multiple_rows_sorted	1	1.099 ms	0.0197 ms	0.0175 ms	1.098 ms	1.12	0.15
Insert_with_arrays	1	1.088 ms	0.0217 ms	0.0633 ms	1.100 ms	1.03	0.05
Insert_with_arrays_sorted	1	1.130 ms	0.0180 ms	0.0159 ms	1.127 ms	1.15	0.16
Copy	1	1.188 ms	0.0223 ms	0.0209 ms	1.189 ms	1.23	0.18

Batched_inserts	2	1.109 ms	0.0220 ms	0.0629 ms	1.120 ms	1.00	0.00
Insert_with_multiple_rows	2	1.092 ms	0.0217 ms	0.0193 ms	1.093 ms	1.08	0.17
Insert_with_multiple_rows_sorted	2	1.121 ms	0.0222 ms	0.0645 ms	1.133 ms	1.01	0.10
Insert_with_arrays	2	1.129 ms	0.0163 ms	0.0144 ms	1.130 ms	1.11	0.17
Insert_with_arrays_sorted	2	1.160 ms	0.0230 ms	0.0192 ms	1.162 ms	1.15	0.18
Copy	2	1.223 ms	0.0243 ms	0.0639 ms	1.234 ms	1.11	0.03

Batched_inserts	5	1.411 ms	0.0205 ms	0.0192 ms	1.406 ms	1.00	0.00
Insert_with_multiple_rows	5	1.160 ms	0.0311 ms	0.0886 ms	1.181 ms	0.76	0.10
Insert_with_multiple_rows_sorted	5	1.288 ms	0.0243 ms	0.0238 ms	1.277 ms	0.91	0.02
Insert_with_arrays	5	1.068 ms	0.0224 ms	0.0659 ms	1.082 ms	0.74	0.07
Insert_with_arrays_sorted	5	1.108 ms	0.0213 ms	0.0270 ms	1.117 ms	0.78	0.03
Copy	5	1.093 ms	0.0218 ms	0.0555 ms	1.098 ms	0.79	0.07

Batched_inserts	10	1.406 ms	0.0177 ms	0.0165 ms	1.404 ms	1.00	0.00
Insert_with_multiple_rows	10	1.098 ms	0.0202 ms	0.0189 ms	1.098 ms	0.78	0.01
Insert_with_multiple_rows_sorted	10	1.110 ms	0.0248 ms	0.0727 ms	1.128 ms	0.78	0.08
Insert_with_arrays	10	1.112 ms	0.0150 ms	0.0141 ms	1.110 ms	0.79	0.01
Insert_with_arrays_sorted	10	1.140 ms	0.0213 ms	0.0199 ms	1.147 ms	0.81	0.02
Copy	10	1.139 ms	0.0226 ms	0.0612 ms	1.143 ms	0.76	0.08

Batched_inserts	100	2.405 ms	0.0441 ms	0.0391 ms	2.409 ms	1.00	0.00
Insert_with_multiple_rows	100	1.373 ms	0.0097 ms	0.0086 ms	1.374 ms	0.57	0.01
Insert_with_multiple_rows_sorted	100	1.446 ms	0.0275 ms	0.0270 ms	1.451 ms	0.60	0.02
Insert_with_arrays	100	1.333 ms	0.0256 ms	0.0350 ms	1.324 ms	0.56	0.02
Insert_with_arrays_sorted	100	1.407 ms	0.0143 ms	0.0111 ms	1.405 ms	0.59	0.01
Copy	100	1.343 ms	0.0124 ms	0.0116 ms	1.341 ms	0.56	0.01

Batched_inserts	500	6.781 ms	0.1345 ms	0.2425 ms	6.753 ms	1.00	0.00
Insert_with_multiple_rows	500	2.205 ms	0.0437 ms	0.0832 ms	2.188 ms	0.33	0.02
Insert_with_multiple_rows_sorted	500	2.333 ms	0.0460 ms	0.0909 ms	2.351 ms	0.34	0.02
Insert_with_arrays	500	2.002 ms	0.0241 ms	0.0225 ms	1.995 ms	0.29	0.01
Insert_with_arrays_sorted	500	2.187 ms	0.0431 ms	0.1008 ms	2.209 ms	0.32	0.02
Copy	500	2.154 ms	0.0429 ms	0.0401 ms	2.153 ms	0.31	0.01

Batched_inserts	1000	12.433 ms	0.2481 ms	0.5651 ms	12.372 ms	1.00	0.00
Insert_with_multiple_rows	1000	2.981 ms	0.0581 ms	0.0795 ms	2.993 ms	0.24	0.01
Insert_with_multiple_rows_sorted	1000	3.360 ms	0.0672 ms	0.1159 ms	3.373 ms	0.27	0.01
Insert_with_arrays	1000	2.683 ms	0.0519 ms	0.0728 ms	2.684 ms	0.22	0.01
Insert_with_arrays_sorted	1000	3.088 ms	0.0603 ms	0.0619 ms	3.084 ms	0.26	0.01
Copy	1000	3.457 ms	0.0686 ms	0.1166 ms	3.492 ms	0.28	0.01

Batched_inserts	10000	102.829 ms	2.0541 ms	3.0108 ms	102.639 ms	1.00	0.00
Insert_with_multiple_rows	10000	21.879 ms	0.7110 ms	2.0628 ms	21.330 ms	0.21	0.02
Insert_with_multiple_rows_sorted	10000	25.269 ms	0.8346 ms	2.3404 ms	25.181 ms	0.23	0.02
Insert_with_arrays	10000	18.460 ms	1.2062 ms	3.5185 ms	17.058 ms	0.22	0.03
Insert_with_arrays_sorted	10000	18.913 ms	0.4442 ms	1.2601 ms	18.841 ms	0.18	0.01
Copy	10000	7.258 ms	0.1450 ms	0.2724 ms	7.280 ms	0.07	0.00

tl;dr we should do Insert_with_arrays starting from 2-5 rows (benchmark to get the exact threshold).

Notes:

When using database-generated keys (e.g. identity), the RETURNING clause doesn't guarantee row ordering, so we need to wrap it in a WITH statement and apply ordering explicitly; this is what the _sorted methods do above. If there are no database-generated keys (e.g. client-generated GUIDs), the unsorted version is fine.
Insert_with_multiple_rows has the disadvantage of having different SQL for different numbers of rows, which means that the same prepared statement cannot be used. This isn't manifested above since the same number of rows is used. The other methods do not have this drawback.
Batched_inserts has the slight disadvantage of requiring a transaction (which isn't done in the above benchmark), whereas the other options are a single statement.
Copy (binary import) is included mainly for comparison purposes - it would require lots of effort to use it for SaveChanges (and would only work when there are no database-generated values). Also, Copy starts to be better than Insert_with_arrays only after over 1000 rows, which is an extreme scenario; at that point, you also want to avoid the change tracking overhead and use the EF future dedicated import API (https://github.com/dotnet/efcore/issues/27333).

/cc @AndriySvyryd

Benchmark code

```c# BenchmarkRunner.Run(); public class Benchmark { [Params(1, 2, 5, 10, 100, 500, 1000, 10000)] public int NumRows { get; set; } private NpgsqlConnection _connection; private NpgsqlCommand _command; private async Task Setup() { _connection = new NpgsqlConnection("Host=localhost;Username=test;Password=test;Max Auto Prepare = 10"); await _connection.OpenAsync(); await using var command = _connection.CreateCommand(); command.CommandText = "DROP TABLE IF EXISTS foo; CREATE TABLE foo (id INT, data INT)"; await command.ExecuteNonQueryAsync(); } [GlobalSetup(Target = nameof(Batched_inserts))] public async Task Setup_Batched_inserts() { await Setup(); _command = _connection.CreateCommand(); _command.CommandText = new StringBuilder() .AppendJoin(" ", Enumerable.Range(0, NumRows).Select(i => $"INSERT INTO foo (data) VALUES (@p{i});")) .ToString(); for (var i = 0; i < NumRows; i++) { var param = _command.CreateParameter(); param.ParameterName = "p" + i; param.Value = i; _command.Parameters.Add(param); } } [GlobalSetup(Target = nameof(Insert_with_multiple_rows))] public async Task Setup_Insert_with_multiple_rows() { await Setup(); _command = _connection.CreateCommand(); var stringBuilder = new StringBuilder("INSERT INTO foo (data) VALUES"); for (var i = 0; i < NumRows; i++) { stringBuilder .Append(i == 0 ? " " : ", ") .Append($"(@p{i})"); var param = _command.CreateParameter(); param.ParameterName = "p" + i; param.Value = i; _command.Parameters.Add(param); } _command.CommandText = stringBuilder.ToString(); } [GlobalSetup(Target = nameof(Insert_with_multiple_rows_sorted))] public async Task Setup_Insert_with_multiple_rows_sorted() { await Setup_Insert_with_multiple_rows(); _command.CommandText = @$"WITH bar AS ( {_command.CommandText} RETURNING id ) SELECT * FROM bar ORDER BY id"; } [GlobalSetup(Target = nameof(Insert_with_arrays))] public async Task Setup_Insert_with_arrays() { await Setup(); _command = _connection.CreateCommand(); _command.CommandText = "INSERT INTO foo (data) SELECT * FROM unnest(@i)"; var param = _command.CreateParameter(); param.ParameterName = "i"; param.Value = Enumerable.Range(0, NumRows).ToArray(); _command.Parameters.Add(param); } [GlobalSetup(Target = nameof(Insert_with_arrays_sorted))] public async Task Setup_Insert_with_arrays_sorted() { await Setup_Insert_with_arrays(); _command.CommandText = @$"WITH bar AS ( {_command.CommandText} RETURNING id ) SELECT * FROM bar ORDER BY id"; } [GlobalSetup(Target = nameof(Copy))] public Task Setup_Copy() => Setup(); [Benchmark(Baseline = true)] public async Task Batched_inserts() => await _command.ExecuteNonQueryAsync(); [Benchmark] public async Task Insert_with_multiple_rows() => await _command.ExecuteNonQueryAsync(); [Benchmark] public async Task Insert_with_multiple_rows_sorted() => await _command.ExecuteNonQueryAsync(); [Benchmark] public async Task Insert_with_arrays() => await _command.ExecuteNonQueryAsync(); [Benchmark] public async Task Insert_with_arrays_sorted() => await _command.ExecuteNonQueryAsync(); [Benchmark] public async Task Copy() { using var importer = _connection.BeginBinaryImport( "COPY foo (data) FROM STDIN (FORMAT binary)"); for (var i = 0; i < NumRows; i++) { await importer.StartRowAsync(); await importer.WriteAsync(i); } await importer.CompleteAsync(); } } ```

roji commented 2 years ago

An important problem with the above optimizations, is that they don't guarantee the ordering of database-generated values returned via the RETURNING clause; this means we can't match up the returned IDs to their corresponding entity instances client-side.

The solution here would probably be to use MERGE in the same way as SQL Server - MERGE is coming to PostgreSQL 15.

douglasg14b commented 2 years ago

This is great stuff! It's always been a pain point with DB-seeding scripts that utilize business logic for seeding and you end up seeding your graph in 20-30 seconds then waiting 5 minutes for the EF Core inserts 😅

roji commented 2 years ago

@douglasg14b in general, EF's seeding feature isn't suitable for use with many rows for various reasons (e.g. seeding is present in the migration snapshots on disk, which would become huge). So I don't think this is very relevant for seeding.

douglasg14b commented 2 years ago

@roji I'm not talking about using EFs seeding features. Just logic that generates data & state, and many entities, in C#, and with DbContext.

roji commented 11 months ago

Note that for inserting a large number of rows (bulk insert/import), https://github.com/dotnet/efcore/issues/27333 is in general a better approach (expose Npgsql's binary COPY via a standard EF API).

npgsql / efcore.pg

Optimize multiple insertions #113