jtv / libpqxx

The official C++ client API for PostgreSQL.
http://pqxx.org/pqxx/
BSD 3-Clause "New" or "Revised" License
959 stars 227 forks source link

Failure on streaming a list of empty strings #816

Closed vahancho closed 1 month ago

vahancho commented 1 month ago

I observe an interesting behavior. Let's assume we have a table with a single column that represents an array of strings. A simple SQL query to create such a table would look like:

CREATE TABLE IF NOT EXISTS test
(
    list TEXT[]
);

Let's now stream data to this table using pqxx::stream_to class.

[..]
// Create a transactional object.
pqxx::work work(dbConnection);

using Items = std::vector<std::string>;
using DataType = std::variant<Items>;

Items items{ "", "", "", "", "", "", "" };
DataType dt = items;

auto streamToTable = pqxx::stream_to::table(work, {"test"});
streamToTable.write_row(dt);
streamToTable.complete();
work.commit();
[..]

It's works good if the number of empty strings in the list is less than four. If the number of empty strings is 4, 5 or 7 it fails with an exception:

Failure during '[END COPY]': ERROR:  invalid byte sequence for encoding "UTF8": 0x00
CONTEXT:  COPY test, line 1

With six strings in the list it fails with segmentation fault.

I use libpqxx 7.9.0 (build from sources) on Oracle Linux 8.

jtv commented 1 month ago

That's a nasty one! Thanks for the report. It's not clear to me why you pass the field as a std::variant, but I can reproduce the problem without that. Debugging.

jtv commented 1 month ago

Found it! It wasn't even the streaming that was the problem. It was the code that represents an array as a string. Its buffer budget calculation was just a little too tight, leading the conversion to allocate slightly too little memory. This wasn't normally an issue, but empty strings happen to be an extreme case in budgeting conversions.

vahancho commented 1 month ago

Yeah, I noticed that it tries to convert a vector of strings to a single string with comma separated elements and inserts at some point (depending on the number of strings) a null character in the middle.

I use std:: variant because I don't know the field value upfront. Sometimes it can be empty too.

jtv commented 1 month ago

The overloading gets a bit confusing and I kind of regret it now. I may do some deprecation go get a stricter separation between "a container or tuple is a row" and "a container or tuple is a value."

@vahancho does the fix address all your use-cases?

vahancho commented 1 month ago

@jtv I didn't test it yet, but the case of using more than three empty strings in the container was the only one I had issues with. Thank you for the comprehensive analysis and quick fix.

jtv commented 1 month ago

Excellent. Thanks @vahancho.