taocpp / taopq

C++ client library for PostgreSQL
Boost Software License 1.0
265 stars 40 forks source link

Batch Insert values #39

Closed skaae closed 3 years ago

skaae commented 3 years ago

Is it possible to make a statement that will insert N values into a table?

I would like to be able to do something like

https://github.com/taocpp/taopq/blob/e1e5cc016b8f7f1bb0d4e61f8f79a6bc448e5cad/src/test/pq/traits.cpp#L56

but with multiple values at once?

For two rows the sql statement would look like

INSERT INTO tao_parameter_test VALUES 
                               ( $1, $2, $3, $4 ),
                               ( $5, $6, $7, $8 );

I know I can type it manually but for performance reasonse i would like to insert ~500 values at once which would be a lot of typing :)

Thanks for putting up the library. I find it very easy to use.

best Søren

d-frey commented 3 years ago

If you want to insert larger amounts of data efficiently, you should have a look at the table_writer. It is not as convenient to use as the rest, but it is vastly more efficient than using "normal" SQL statements. Check out the example at https://github.com/taocpp/taopq/blob/master/src/test/pq/table_writer.cpp

skaae commented 3 years ago

Thanks Daniel, After reading https://blog.timescale.com/blog/13-tips-to-improve-postgresql-insert-performance/ (section "7. Insert rows in batches. ")

I ended up doing:

    // CREATE TABLE test_table ( TEXT NOT NULL)
    // string vector of input text defined above

    std::vector<std::string> rows;
    for (const auto& text_row : input_rows){
        rows.push_back(fmt::format("( {} )", 
                                   PQescapeLiteral( conn->underlying_raw_ptr(), text_row.data(), text_row.size()) ));
    }
    const auto tr = conn->transaction();
    tr->execute(fmt::format("INSERT INTO test_table VALUES {};", fmt::join(vals, ",")));

I haven't checked if its slower or faster than using table_writer but at least it gives a huge speedup compared to using single row insertions.

d-frey commented 3 years ago

I'd still recommend that you benchmark this against the table_writer and the underlying concept from PostgreSQL, which is also mentioned in the link you posted: COPY instead of INSERT. Admittedly our wrapper is not very convenient and you have to do most of the legwork, but we recently had to do a migration from a different DB system to PostgreSQL with a large-scale production database including various on-the-fly data conversions and COPY was very helpful with achieving a very short downtime of our service.

emmenlau commented 3 years ago

I would also be very curious about these results. Due to the high performance benefit, it would be nice to promote the table_writer more in taopq. We just found it by chance and never got around to test it, but AFAICS it should bring a very relevant performance benefit for batch inserts, and would justify the implementation of a high level "batch insert" method?

d-frey commented 3 years ago

I've renamed the table_writer's insert method to insert_raw and I've added a new insert method which takes a bunch of parameters that should represent a single row and converts them internally, then adds them to the buffer. See the modified example for the table_writer.

I've timed the example with ~400.000 rows/s on my machine with the example's simple three column, no index table.

The implementation is currently incomplete, as I am not doing any escaping. I will add that when I have time.

emmenlau commented 3 years ago

Awesome!!!

skaae commented 3 years ago

Thanks Daniel. I'm at ~50K rows on my old laptop, but the table have more columns and one of the columns contains XML messages that can be relatively large. I will try the new table writer.

FyiI found PQescapeStringConn to be faster then PQescapeLiteral because i can reuse the output buffer. For the table writer do you plan to have taopq handle escaping? I spend quite some time figuring out how to escape sql statements and how to handle NaN values.

d-frey commented 3 years ago

The challenge with escaping for COPY is, that COPY can accept three different formats (binary, text, CSV) and you can configure a lot of the special symbols used. That is also the reason why libpq does not come with a suitable escape function. The existing functions like PQescapeLiteral, etc. are build for different purposes and won't work with COPY.

So yeah, I hope to be able to support escaping but it will probably be limited to the text format with default symbols for escape, field separator, etc.

emmenlau commented 3 years ago

Sorry for the very blunt question, but do I understand correctly that using COPY without proper escaping may open a door for SQL injection? I'm not knowledgeable with COPY so I may be on the completely wrong track here. Not having proper escaping just "sounds" dangerous... is it?

d-frey commented 3 years ago

No, I don't think so. No escaping just means data corruption with COPY. The reason is simple: You are sending the COPY SQL statement and the data separately. So when you screw up the data, it will still only be interpreted as data, but never as additional SQL statements. This is different from when people send a string that contains both the SQL statement as well as the data in a mixed fashion. It is also the reason why I designed taoPQ in a way that you should always use placeholders in your SQL statement and send the data via additional parameters.

emmenlau commented 3 years ago

Thanks a lot @d-frey , then everything here sounds very good and promising! We have many use cases in which data is purely numeric tables and escaping is most likely not necessary (or should be sufficiently simple). Thanks for the great work!

d-frey commented 3 years ago

I've added some non-configurable escaping, it should work with the defaults. It is still somewhat inefficient, I might improve that when I have more time.

skaae commented 3 years ago

I tried the table writer and it works great. I think i found two minor issues.

I need to import #include <tao/pq/table_writer.hpp> otherwise i get an incomplete type error for the table writer. tw.insert doesn't work with string view. I can try to submit a pr which fixes the latter when i have time.

edit: I tested COPY using the table writer vs INSERT with multiple lines. When I'm using a remote db then COPY seems to be 5-10 x faster.

d-frey commented 3 years ago

I've added the missing include to <tao/pq.hpp>.

COPY being 5-10x faster: Yeah, it makes quite the difference 😎

For the std::string_view support: This was left out on purpose, as it might be inefficient without the user realizing it. This is due to the fact that the text-based API from libpq requires a null-terminated const char* and that means creating a copy of the std::string_view internally. So for now, you will have to explicitly add std::string(...) in your call and thus you are aware of the copy. OTOH, I see that this is also quite inconvenient. Opinions? Also pinging @ColinH

d-frey commented 3 years ago

OK, this might sound weird: I implemented traits for std::string_view in the generic internal traits class. Plus I enhanced the traits to support the table writer better. As a result:

For now, I closed this issue as I think we made pretty good progress, but feel free to re-open if you feel that this is not sufficient or open a new issue if appropriate.

FWIW, the test that previously inserted >400.000 rows/s on my laptop now achieves >1.000.000 rows/s with the recent optimizations.