taocpp / taopq

C++ client library for PostgreSQL
Boost Software License 1.0
264 stars 40 forks source link

Support for binary storage of numerical values? #18

Closed emmenlau closed 3 years ago

emmenlau commented 5 years ago

I'm under the impression that taopq does currently not support binary storage of numerical values, is that correct? This less commonly known mode of libpq allows to store and retrieve numerical values directly as a series of bytes (albeit in network storage order). So for example, an int2 can be stored as the two bytes (low byte, high byte) of the short value. The main benefits are:

Could this be a possible extension for taopq?

d-frey commented 5 years ago

It requires rewriting the parameter traits (with a different interface) and rewriting transaction::execute and all internals. Ultimately, it seems that it should offer multiple benefits - besides efficiency and accuracy for numerical values it also allows support for bytea values. I'll give it a shot, but don't hold your breath :)

emmenlau commented 5 years ago

Yes its quite a change I agree. A tricky part was to get a data structure that holds the swapped the bytes, and then create char pointers from there. But your traits are quite well written and may be a much better basis than what I tried. So I keep my fingers crossed!

d-frey commented 5 years ago

The tricky part IMO is handling traits that generate more than a single value (and even allowing that recursively) and do most of the magic at compile-time. But I already have some ideas... :)

d-frey commented 5 years ago

I created a branch to play with some ideas. Currently the new framework works for parameters from the client towards the server only and I could use it with some traits. Some pros and cons I found so far:

emmenlau commented 5 years ago

Dear @d-frey this is great progress already! And I can acknowledge/reproduce most of your findings. The biggest issue for us was the lack of unsigned data types, but since this is a limitation of the underlying framework I think there is not much that can be done? Or could the postgresql types be used to interpret / cast transparently? For storing char types we use bytea which is also not ideal but works ok.

emmenlau commented 5 years ago

Dear @d-frey, did you make further progress in this direction? What is your stance on storing/retrieving binary values?

d-frey commented 5 years ago

Not much, I'm afraid. I talked to Colin about the design and it all seems feasible, I made some small progress but I was busy with a lot of other things. I can't make any promises on the time-frame, I'll have to see when it fits in.

emmenlau commented 5 years ago

Thanks again for your work!

Here is a small update I did on benchmarking binary vs text inserts. The numbers usually come out like this on most platforms (MacOS, Windows, Linux). But take them with a few grains of salt, especially because the code for variadic argument parsing is from me, uses a plain old loop and is not optimized. But text based inserts use the same loop as binary inserts, so the numbers should be comparable to some extent.

I run the same benchmark 4 times to see how much the operating system starts caching over time. The first execution is certainly the slowest, and text based insert seems to have higher variance over time. But binary insert is almost always faster for me.

Runtime of text-based insert: 9382.44ms.
Runtime of binary insert: 8409.69ms.
Runtime of text-based insert: 9325.38ms.
Runtime of binary insert: 8575.57ms.
Runtime of text-based insert: 8833.74ms.
Runtime of binary insert: 8492.02ms.
Runtime of text-based insert: 8625.63ms.
Runtime of binary insert: 8491.92ms.
emmenlau commented 4 years ago

Dear @d-frey I've just checked the branch https://github.com/taocpp/taopq/tree/experimental_binary_parameter again and it looked like its already merged into master, is that possible? Does taopq already support binary parameters? That would be awesome!

d-frey commented 4 years ago

I'm getting there, but there is still work to be done. For now, the default will stay text-based traits. You can manually choose the binary traits by providing it to the execute calls, e.g.

conn->execute< parameter_binary_traits >( ... );
d-frey commented 3 years ago

I am currently considering merging the text and binary traits into a single parameter traits class, mostly (but not exclusively) using the text format. Only non-zero-terminated strings (std::string_view) and binary data (BYTEA) will be transferred in binary format when applicable. And I'm not even sure whether I should keep that.

Supporting both formats and being able to select the traits as a user makes the library way more difficult to implement, and as more code is now in the headers, it also makes it slower to compile for the users. Plus for the users it adds complexity to the interface, to error messages, and raises questions about which traits class is better/faster. Due to recent optimizations, the binary format for numbers doesn't show any significant performance improvement in my benchmarks anymore. All supported compilers also improved the correctness of double conversions, so those are no longer less accurate when using text format.

One more point I don't like: The current limitations of the binary interface lead to differences in the supported data types for each traits class.

To sum it up: I think I will work on removing the binary traits unless someone can convince me otherwise.

emmenlau commented 3 years ago

Dear @d-frey, thanks a lot for the explanation! I can understand your sentiments, i.e. if supporting the binary traits is excessive effort. But let me try to give a motivation for the binary traits.

In my humble opinion, I see not so much value in the text format as in the binary format. The conversion to and from text is even nowadays not a trivial task, and looking at projects like fmt, grisu or ryu their achievements in runtime are sometimes bought with bugs or strange corner cases. While its certainly awesome that conversion speed was significantly improved in the past years, there is still a huge amount of complexity in this task. And despite the performance improvements, precision for floating point values is currently limited to whatever hard-coded conversion precision is used in the libraries, for example the snprintf( m_buffer, "%.17g", v ); in taopq. While this may be possible to change, its not necessarily a trivial task. I.e. going from "%.17g" to "%.17f" in above call can create string representations of more than 300 characters for more extreme double values (tested on Ubuntu 20.04 with clang 12). Knowledgeable users may not make such mistakes, but this is just to show that the conversion is far from trivial, with many possible pitfalls.

All in all, this may not sound so bad for someone used to the text interface. But when you put the interfaces into question, I think the binary interface shines quite bright. AFAIK there is no conversion involved whatsoever. Numerical values keep their exact bits, and a hugely complex code path is removed. There is no need for users to accept an implicit loss of precision, and no longer an implicit rule that the database may not preserve identical values. Instead, the DB does what (naiive) users would just expect: give you back exactly what you put in, bit by bit.

d-frey commented 3 years ago

@emmenlau You don't have to convince me, I would really prefer to use binary. The problem is the lack of support from PostgreSQL:

PostgreSQL should support unsigned integer types. Honestly, I am at a total loss as to why they don't do this. I've read some arguments, but I don't agree that this is adding "too much complexity". Those are fundamental data types that simply have to be supported.

COPY TO should support a proper binary format. Currently, the documentation contains this gem:

To determine the appropriate binary format for the actual tuple data you should consult the PostgreSQL source, in particular the send and recv functions for each column's data type (typically these functions are found in the src/backend/utils/adt/ directory of the source distribution).

ARRAYs should be supported in libpq's API, not via a weird text format. This format seems independent of other text formats, there is no support for escaping/unescaping/parsing in libpq, and it requires all element types to have a text format representation. Even if I use binary for integers, I still need a text format for integers to support ARRAYs of integers. Ideally, there would be a structured way in libpq that allows sending the elements in the exact binary or text format as in the other cases to keep it consistent.

BYTEA is supported partially, as you can send it in binary format. But again, you need a text representation for ARRAYS and the table writer's text format (which I currently have to use due to the binary format being unusable). Receiving BYTEA is also inefficient, plus the decoder doesn't play nicely with C++ containers, as it always does it's own allocation and can't write to a given memory area.

While I still think PostgreSQL is one of the best DBs out there, it is not without problems.

I am still considering a hybrid approach, but it will not be reflected in the API of taoPQ anymore. If possible, I will keep some binary formats in the background, but this should be transparent to the users.

emmenlau commented 3 years ago

Ok I very much see your point! :-(

But still a few questions:

PostgreSQL should support unsigned integer types. Honestly, I am at a total loss as to why they don't do this. I've read some arguments, but I don't agree that this is adding "too much complexity". Those are fundamental data types that simply have to be supported.

I fully agree! But a (naiive) question if I may: Is there a difference in support for unsigned integer types between binary and text interfaces? I thought they are unsupported in both, so the conversion to string is not really different from a cast to signed (potentially with an overflow check)?

BYTEA is supported partially, as you can send it in binary format. But again, you need a text representation for ARRAYS and the table writer's text format (which I currently have to use due to the binary format being unusable). Receiving BYTEA is also inefficient, plus the decoder doesn't play nicely with C++ containers, as it always does it's own allocation and can't write to a given memory area.

This sounds just bad, but also as if it may be fixed upstream if we put in a good word for it? I agree that BYTEA should just work, and at least making the decoder play nicely with C++ containers should be fairly straightforward upstream (and also non-intrusive) unless I'm completely mistaken.

I'm not sure if that helps at all. So all I can say is that it would be great to keep as much binary available as possible, at the least possible maintenance cost for you! And as always, thanks for the awesome work!

d-frey commented 3 years ago

PostgreSQL should support unsigned integer types. Honestly, I am at a total loss as to why they don't do this. I've read some arguments, but I don't agree that this is adding "too much complexity". Those are fundamental data types that simply have to be supported.

I fully agree! But a (naiive) question if I may: Is there a difference in support for unsigned integer types between binary and text interfaces? I thought they are unsupported in both, so the conversion to string is not really different from a cast to signed (potentially with an overflow check)?

The text format doesn't care much about the size, it can send 100 digits, no problem. Binary can't do that. Of course I send a 16-bit unsigned integer as a 32-bit signed integer as that preserves the range and thereby the semantics. The problem is the 64-bit unsigned integer types (unsigned long long, ...). There is no larger binary integer type and sending them as 64-bit signed integers changes the semantics, so that is a no-go. It means that I can support all integers as binary up to 64-bit signed integers. 64-bit unsigned integers (the most common type in my projects!) requires the text format.

BYTEA is supported partially, as you can send it in binary format. But again, you need a text representation for ARRAYS and the table writer's text format (which I currently have to use due to the binary format being unusable). Receiving BYTEA is also inefficient, plus the decoder doesn't play nicely with C++ containers, as it always does it's own allocation and can't write to a given memory area.

This sounds just bad, but also as if it may be fixed upstream if we put in a good word for it? I agree that BYTEA should just work, and at least making the decoder play nicely with C++ containers should be fairly straightforward upstream (and also non-intrusive) unless I'm completely mistaken.

My experience with the PostgreSQL developers doesn't give me much hope :( I was looking into this and I might just manually convert those instead of using libpq's API for decoding them. Given that the hex format is at least properly described, it is possible.

That should solve this problem, together with a little hack that I already added to taoPQ https://github.com/taocpp/taopq/blob/main/src/lib/pq/internal/resize_uninitialized.cpp and that is totally not PostgreSQL's fault, but the C++ standard's fault.

I'm not sure if that helps at all. So all I can say is that it would be great to keep as much binary available as possible, at the least possible maintenance cost for you! And as always, thanks for the awesome work!

I agree and I try to keep as much of the binary interface as possible. I don't even mind how much work it is for me (within limits, of course), but the real issue is the complexity in the interface towards our users. I want to keep it simple, so binary vs. text format should be an implementation detail. This is what I'm working on right now, but it might be a few days before I have something reasonable that is ready to be pushed.

emmenlau commented 3 years ago

PostgreSQL should support unsigned integer types. Honestly, I am at a total loss as to why they don't do this. I've read some arguments, but I don't agree that this is adding "too much complexity". Those are fundamental data types that simply have to be supported.

I fully agree! But a (naiive) question if I may: Is there a difference in support for unsigned integer types between binary and text interfaces? I thought they are unsupported in both, so the conversion to string is not really different from a cast to signed (potentially with an overflow check)?

The text format doesn't care much about the size, it can send 100 digits, no problem. Binary can't do that. Of course I send a 16-bit unsigned integer as a 32-bit signed integer as that preserves the range and thereby the semantics. The problem is the 64-bit unsigned integer types (unsigned long long, ...). There is no larger binary integer type and sending them as 64-bit signed integers changes the semantics, so that is a no-go. It means that I can support all integers as binary up to 64-bit signed integers. 64-bit unsigned integers (the most common type in my projects!) requires the text format.

This is actually quite interesting, but I'm not sure I fully understand. Just to confirm, does that mean a column of type bigint can store values exceeding the signed range, as long as you send it as text? Or you use a different type than bigint?

d-frey commented 3 years ago

This is actually quite interesting, but I'm not sure I fully understand. Just to confirm, does that mean a column of type bigint can store values exceeding the signed range, as long as you send it as text? Or you use a different type than bigint?

Not BIGINT but NUMERIC:

template1=# create table dummy ( v numeric );
CREATE TABLE
template1=# insert into dummy values ( '99999999999999999999999999999999999999999999999999999999999' );
INSERT 0 1
template1=# select v + 1 from dummy;
                           ?column?                           
--------------------------------------------------------------
 100000000000000000000000000000000000000000000000000000000000
(1 row)

template1=# 
emmenlau commented 3 years ago

Ok this does make sense, thanks for the clarification.

d-frey commented 3 years ago

After I made some really good progress, I discovered a major problem. I knew about it before, but I guess I never fully realized the implications: When using binary format with prepared statements, the prepared statement needs to contain the type of the data I will send. The execution of a prepared statement does not use the type specified at all, the API simply doesn't take the "types"-structure. That is a big impact on users and I am not willing to put that much burden on them.

That said, I will probably switch to text format now. I'm sad about this, as this does have a small performance impact, but there is no other way forward given that PostgreSQL obviously does not care about binary formats at all. :(

d-frey commented 3 years ago

And one more thing I learned from my experiments: Transferring float-point values in binary mode is less precise when combined with columns of type NUMERIC. The text format does not have that problem. Seems weird, but once more shows that PostgreSQL really prefers text format over binary format.

d-frey commented 3 years ago

I pushed the changes, now using text format everywhere except for std::string_view which uses Oid 25 (TEXTOID) and binary data which uses Oid 17 (BYTEAOID). One only needs to mark positional parameters in prepared statements that will receive binary data with the type explicitly, e.g. $1::BYTEA.