Closed emmenlau closed 3 years ago
It requires rewriting the parameter traits (with a different interface) and rewriting transaction::execute
and all internals. Ultimately, it seems that it should offer multiple benefits - besides efficiency and accuracy for numerical values it also allows support for bytea
values. I'll give it a shot, but don't hold your breath :)
Yes its quite a change I agree. A tricky part was to get a data structure that holds the swapped the bytes, and then create char pointers from there. But your traits are quite well written and may be a much better basis than what I tried. So I keep my fingers crossed!
The tricky part IMO is handling traits that generate more than a single value (and even allowing that recursively) and do most of the magic at compile-time. But I already have some ideas... :)
I created a branch to play with some ideas. Currently the new framework works for parameters from the client towards the server only and I could use it with some traits. Some pros and cons I found so far:
bytea
) easily. Not implemented yet, but I'll try it out soon.int
directly but not unsigned int
. Boooo!$1::integer
. This might lead to more security, but it's also kinda invasive towards the user.signed char
, unsigned char
, ...float
and double
there are benefits wrt the precision. Even edge-cases work easily.long double
.long
will map to PostgreSQL's integer
or bigint
depending on how large the type is on your system. LP64 != LLP64. https://en.wikipedia.org/wiki/64-bit_computingDear @d-frey this is great progress already! And I can acknowledge/reproduce most of your findings. The biggest issue for us was the lack of unsigned data types, but since this is a limitation of the underlying framework I think there is not much that can be done? Or could the postgresql types be used to interpret / cast transparently?
For storing char
types we use bytea
which is also not ideal but works ok.
Dear @d-frey, did you make further progress in this direction? What is your stance on storing/retrieving binary values?
Not much, I'm afraid. I talked to Colin about the design and it all seems feasible, I made some small progress but I was busy with a lot of other things. I can't make any promises on the time-frame, I'll have to see when it fits in.
Thanks again for your work!
Here is a small update I did on benchmarking binary vs text inserts. The numbers usually come out like this on most platforms (MacOS, Windows, Linux). But take them with a few grains of salt, especially because the code for variadic argument parsing is from me, uses a plain old loop and is not optimized. But text based inserts use the same loop as binary inserts, so the numbers should be comparable to some extent.
I run the same benchmark 4 times to see how much the operating system starts caching over time. The first execution is certainly the slowest, and text based insert seems to have higher variance over time. But binary insert is almost always faster for me.
Runtime of text-based insert: 9382.44ms.
Runtime of binary insert: 8409.69ms.
Runtime of text-based insert: 9325.38ms.
Runtime of binary insert: 8575.57ms.
Runtime of text-based insert: 8833.74ms.
Runtime of binary insert: 8492.02ms.
Runtime of text-based insert: 8625.63ms.
Runtime of binary insert: 8491.92ms.
Dear @d-frey I've just checked the branch https://github.com/taocpp/taopq/tree/experimental_binary_parameter again and it looked like its already merged into master, is that possible? Does taopq already support binary parameters? That would be awesome!
I'm getting there, but there is still work to be done. For now, the default will stay text-based traits. You can manually choose the binary traits by providing it to the execute
calls, e.g.
conn->execute< parameter_binary_traits >( ... );
I am currently considering merging the text and binary traits into a single parameter traits class, mostly (but not exclusively) using the text format. Only non-zero-terminated strings (std::string_view
) and binary data (BYTEA
) will be transferred in binary format when applicable. And I'm not even sure whether I should keep that.
Supporting both formats and being able to select the traits as a user makes the library way more difficult to implement, and as more code is now in the headers, it also makes it slower to compile for the users. Plus for the users it adds complexity to the interface, to error messages, and raises questions about which traits class is better/faster. Due to recent optimizations, the binary format for numbers doesn't show any significant performance improvement in my benchmarks anymore. All supported compilers also improved the correctness of double conversions, so those are no longer less accurate when using text format.
One more point I don't like: The current limitations of the binary interface lead to differences in the supported data types for each traits class.
To sum it up: I think I will work on removing the binary traits unless someone can convince me otherwise.
Dear @d-frey, thanks a lot for the explanation! I can understand your sentiments, i.e. if supporting the binary traits is excessive effort. But let me try to give a motivation for the binary traits.
In my humble opinion, I see not so much value in the text format as in the binary format. The conversion to and from text is even nowadays not a trivial task, and looking at projects like fmt
, grisu
or ryu
their achievements in runtime are sometimes bought with bugs or strange corner cases. While its certainly awesome that conversion speed was significantly improved in the past years, there is still a huge amount of complexity in this task. And despite the performance improvements, precision for floating point values is currently limited to whatever hard-coded conversion precision is used in the libraries, for example the snprintf( m_buffer, "%.17g", v );
in taopq. While this may be possible to change, its not necessarily a trivial task. I.e. going from "%.17g"
to "%.17f"
in above call can create string representations of more than 300 characters for more extreme double values (tested on Ubuntu 20.04 with clang 12). Knowledgeable users may not make such mistakes, but this is just to show that the conversion is far from trivial, with many possible pitfalls.
All in all, this may not sound so bad for someone used to the text interface. But when you put the interfaces into question, I think the binary interface shines quite bright. AFAIK there is no conversion involved whatsoever. Numerical values keep their exact bits, and a hugely complex code path is removed. There is no need for users to accept an implicit loss of precision, and no longer an implicit rule that the database may not preserve identical values. Instead, the DB does what (naiive) users would just expect: give you back exactly what you put in, bit by bit.
@emmenlau You don't have to convince me, I would really prefer to use binary. The problem is the lack of support from PostgreSQL:
PostgreSQL should support unsigned integer types. Honestly, I am at a total loss as to why they don't do this. I've read some arguments, but I don't agree that this is adding "too much complexity". Those are fundamental data types that simply have to be supported.
COPY TO should support a proper binary format. Currently, the documentation contains this gem:
To determine the appropriate binary format for the actual tuple data you should consult the PostgreSQL source, in particular the send and recv functions for each column's data type (typically these functions are found in the src/backend/utils/adt/ directory of the source distribution).
ARRAYs should be supported in libpq's API, not via a weird text format. This format seems independent of other text formats, there is no support for escaping/unescaping/parsing in libpq, and it requires all element types to have a text format representation. Even if I use binary for integers, I still need a text format for integers to support ARRAYs of integers. Ideally, there would be a structured way in libpq that allows sending the elements in the exact binary or text format as in the other cases to keep it consistent.
BYTEA is supported partially, as you can send it in binary format. But again, you need a text representation for ARRAYS and the table writer's text format (which I currently have to use due to the binary format being unusable). Receiving BYTEA is also inefficient, plus the decoder doesn't play nicely with C++ containers, as it always does it's own allocation and can't write to a given memory area.
While I still think PostgreSQL is one of the best DBs out there, it is not without problems.
I am still considering a hybrid approach, but it will not be reflected in the API of taoPQ anymore. If possible, I will keep some binary formats in the background, but this should be transparent to the users.
Ok I very much see your point! :-(
But still a few questions:
PostgreSQL should support unsigned integer types. Honestly, I am at a total loss as to why they don't do this. I've read some arguments, but I don't agree that this is adding "too much complexity". Those are fundamental data types that simply have to be supported.
I fully agree! But a (naiive) question if I may: Is there a difference in support for unsigned integer types between binary and text interfaces? I thought they are unsupported in both, so the conversion to string is not really different from a cast to signed (potentially with an overflow check)?
BYTEA is supported partially, as you can send it in binary format. But again, you need a text representation for ARRAYS and the table writer's text format (which I currently have to use due to the binary format being unusable). Receiving BYTEA is also inefficient, plus the decoder doesn't play nicely with C++ containers, as it always does it's own allocation and can't write to a given memory area.
This sounds just bad, but also as if it may be fixed upstream if we put in a good word for it? I agree that BYTEA should just work, and at least making the decoder play nicely with C++ containers should be fairly straightforward upstream (and also non-intrusive) unless I'm completely mistaken.
I'm not sure if that helps at all. So all I can say is that it would be great to keep as much binary available as possible, at the least possible maintenance cost for you! And as always, thanks for the awesome work!
PostgreSQL should support unsigned integer types. Honestly, I am at a total loss as to why they don't do this. I've read some arguments, but I don't agree that this is adding "too much complexity". Those are fundamental data types that simply have to be supported.
I fully agree! But a (naiive) question if I may: Is there a difference in support for unsigned integer types between binary and text interfaces? I thought they are unsupported in both, so the conversion to string is not really different from a cast to signed (potentially with an overflow check)?
The text format doesn't care much about the size, it can send 100 digits, no problem. Binary can't do that. Of course I send a 16-bit unsigned integer as a 32-bit signed integer as that preserves the range and thereby the semantics. The problem is the 64-bit unsigned integer types (unsigned long long
, ...). There is no larger binary integer type and sending them as 64-bit signed integers changes the semantics, so that is a no-go. It means that I can support all integers as binary up to 64-bit signed integers. 64-bit unsigned integers (the most common type in my projects!) requires the text format.
BYTEA is supported partially, as you can send it in binary format. But again, you need a text representation for ARRAYS and the table writer's text format (which I currently have to use due to the binary format being unusable). Receiving BYTEA is also inefficient, plus the decoder doesn't play nicely with C++ containers, as it always does it's own allocation and can't write to a given memory area.
This sounds just bad, but also as if it may be fixed upstream if we put in a good word for it? I agree that BYTEA should just work, and at least making the decoder play nicely with C++ containers should be fairly straightforward upstream (and also non-intrusive) unless I'm completely mistaken.
My experience with the PostgreSQL developers doesn't give me much hope :( I was looking into this and I might just manually convert those instead of using libpq's API for decoding them. Given that the hex format is at least properly described, it is possible.
That should solve this problem, together with a little hack that I already added to taoPQ https://github.com/taocpp/taopq/blob/main/src/lib/pq/internal/resize_uninitialized.cpp and that is totally not PostgreSQL's fault, but the C++ standard's fault.
I'm not sure if that helps at all. So all I can say is that it would be great to keep as much binary available as possible, at the least possible maintenance cost for you! And as always, thanks for the awesome work!
I agree and I try to keep as much of the binary interface as possible. I don't even mind how much work it is for me (within limits, of course), but the real issue is the complexity in the interface towards our users. I want to keep it simple, so binary vs. text format should be an implementation detail. This is what I'm working on right now, but it might be a few days before I have something reasonable that is ready to be pushed.
PostgreSQL should support unsigned integer types. Honestly, I am at a total loss as to why they don't do this. I've read some arguments, but I don't agree that this is adding "too much complexity". Those are fundamental data types that simply have to be supported.
I fully agree! But a (naiive) question if I may: Is there a difference in support for unsigned integer types between binary and text interfaces? I thought they are unsupported in both, so the conversion to string is not really different from a cast to signed (potentially with an overflow check)?
The text format doesn't care much about the size, it can send 100 digits, no problem. Binary can't do that. Of course I send a 16-bit unsigned integer as a 32-bit signed integer as that preserves the range and thereby the semantics. The problem is the 64-bit unsigned integer types (
unsigned long long
, ...). There is no larger binary integer type and sending them as 64-bit signed integers changes the semantics, so that is a no-go. It means that I can support all integers as binary up to 64-bit signed integers. 64-bit unsigned integers (the most common type in my projects!) requires the text format.
This is actually quite interesting, but I'm not sure I fully understand. Just to confirm, does that mean a column of type bigint
can store values exceeding the signed range, as long as you send it as text? Or you use a different type than bigint
?
This is actually quite interesting, but I'm not sure I fully understand. Just to confirm, does that mean a column of type
bigint
can store values exceeding the signed range, as long as you send it as text? Or you use a different type thanbigint
?
Not BIGINT
but NUMERIC
:
template1=# create table dummy ( v numeric );
CREATE TABLE
template1=# insert into dummy values ( '99999999999999999999999999999999999999999999999999999999999' );
INSERT 0 1
template1=# select v + 1 from dummy;
?column?
--------------------------------------------------------------
100000000000000000000000000000000000000000000000000000000000
(1 row)
template1=#
Ok this does make sense, thanks for the clarification.
After I made some really good progress, I discovered a major problem. I knew about it before, but I guess I never fully realized the implications: When using binary format with prepared statements, the prepared statement needs to contain the type of the data I will send. The execution of a prepared statement does not use the type specified at all, the API simply doesn't take the "types"-structure. That is a big impact on users and I am not willing to put that much burden on them.
That said, I will probably switch to text format now. I'm sad about this, as this does have a small performance impact, but there is no other way forward given that PostgreSQL obviously does not care about binary formats at all. :(
And one more thing I learned from my experiments: Transferring float-point values in binary mode is less precise when combined with columns of type NUMERIC
. The text format does not have that problem. Seems weird, but once more shows that PostgreSQL really prefers text format over binary format.
I pushed the changes, now using text format everywhere except for std::string_view
which uses Oid 25 (TEXTOID
) and binary data which uses Oid 17 (BYTEAOID
). One only needs to mark positional parameters in prepared statements that will receive binary data with the type explicitly, e.g. $1::BYTEA
.
I'm under the impression that taopq does currently not support binary storage of numerical values, is that correct? This less commonly known mode of libpq allows to store and retrieve numerical values directly as a series of bytes (albeit in network storage order). So for example, an
int2
can be stored as the two bytes (low byte, high byte) of theshort
value. The main benefits are:Could this be a possible extension for taopq?