jtv / libpqxx

The official C++ client API for PostgreSQL.
http://pqxx.org/libpqxx/
BSD 3-Clause "New" or "Revised" License
1k stars 238 forks source link

Possible to combine "on clonflict do nothing" and "stream_to"? #881

Open chao-mu opened 1 month ago

chao-mu commented 1 month ago

Hello,

I am inserting a massive (hundreds of millions) of rows. Ideally, I want to use the streaming api, however there will likely be conflicts from past iterations. I had an implementation where I would process the workload in chunks and check ahead of time whether something existed before inserting, but that is rife with issues around concurrency and seems far too complex.

Is there a way to get stream_to::table to ignore duplicates?

Thank you, Me.

jtv commented 1 month ago

I doubt ON CONFLICT would work with streams, but I can't think of a better way either so it's worth a try.

chao-mu commented 1 month ago

Oh my goodness, thank you so much for responding! I'm in deep water with a job that currently will take over 200 hours to finish each time it runs if I don't optimize the DB side! I'm processing 90+ million chess games and the moves of each of those games. Needless to say, that's a LOT of inserts.

How would I supply that?

This is how I do it elsewhere:

  const pqxx::table_path games_table_path = {"games"};
  auto games_table = pqxx::stream_to::table(tx, games_table_path, {"column1", "column2", ... }

  for (auto const &game : games) {
    auto row = std::make_tuple(...)
   games_table << row;
}

Also, currently my non stream version of the real work of the process (saving the individual chess boards per move) prepares a statement then with a pqxx::work and runs exec_prepared separately for each row I need to insert. Is there a way to send them all at once?

chao-mu commented 1 month ago

I feel like there's probably a better way than what I'm doing.

How I'm preparing the query

void prepare_insert_boards(pqxx::connection &conn) {
  conn.prepare(
      "insert_boards",
      "insert into boards (hash, white_bishops, white_rooks, white_queens, "
      "white_knights, white_king, white_pawn, black_bishops, black_rooks, "
      "black_queens, black_knights, black_king, black_pawn, white_to_move, "
      "castle_rights, enpassant_sq) values ($1, $2, $3, $4, $5, $6, $7, $8, "
      "$9, $10, $11, $12, $13, $14, $15, $16) on conflict (hash) do nothing");
}

How I'm inserting all the boards in a batch of games

void insert_boards(pqxx::work &tx, const std::vector<Game> &games) {
  std::map<std::string, bool> seen;
  // Add missing boards
  for (const auto &game : games) {
    for (const auto &continuation : game.continuations) {
      const auto &board = continuation.board;

      if (seen[board.hash]) {
        continue;
      }

      tx.exec_prepared("insert_boards", board.hash,
                       to_bits(board.white_bishops), to_bits(board.white_rooks),
                       to_bits(board.white_queens),
                       to_bits(board.white_knights), to_bits(board.white_king),
                       to_bits(board.white_pawn), to_bits(board.black_bishops),
                       to_bits(board.black_rooks), to_bits(board.black_queens),
                       to_bits(board.black_knights), to_bits(board.black_king),
                       to_bits(board.black_pawn), board.white_to_move,
                       board.castle_rights, board.enpassant_sq);

      seen[board.hash] = true;
    }
  }
}

And here's the to_bits code. (The columns used for it are bit(64) in postgresql


std::string to_bits(std::uint64_t value) {
  std::bitset<64> bits(value);

  return bits.to_string();
}
jtv commented 4 weeks ago

A prepared statement is probably not saving you much time, if anything.

Instead, you could try composing a big INSERT statement ad hoc, which inserts multiple rows at once.

jtv commented 4 weeks ago

There are also some tricks that only work under very specific circumstances, such as temporarily disabling constraints or temporarily making the table unlogged. But there are risks to those tricks so I'd start away from them as long as possible.

chao-mu commented 4 weeks ago

Is there a way, in a where clause, to express "in ($1, $2, $3 ... $n)" without building that part of the query manually? For example, in some frameworks you can say "in (?)" and pass an array for that value.

Also, I found my above code failed with an error regarding copy missing column data (for some reason it truncated after the first column I wrote, despite it working fine for a different query) until I rewrote it to use write_values instead of <<.

jtv commented 3 weeks ago

Is there a way, in a where clause, to express "in ($1, $2, $3 ... $n)" without building that part of the query manually? For example, in some frameworks you can say "in (?)" and pass an array for that value.

Have you tried just putting the values in a std::vector and passing that as a single parameter?

Also, I found my above code failed with an error regarding copy missing column data (for some reason it truncated after the first column I wrote, despite it working fine for a different query) until I rewrote it to use write_values instead of <<.

The operator<<() way of inserting is kind of on its way to deprecation. Streams in general are to me a bit of a failed experiment from the 1990s. It sounds to me like there was some kind of confusion with types.

jtv commented 3 weeks ago

(Some of the same questions came up in #879.)

Dich0tomy commented 3 weeks ago

I'm going to chime in about the code as well. You seem to be using a bitset to convert from a uint64 to a string byte representation and using an std::map to check for seen pieces.

One simple optimization that would already help a bit would be changing the std::map to std::unordered_map, going further you could even use a library that provides a faster implementation for a lookup table.

If it's possible I'd also check if it's possible to make board.hash be a natural number, in which case you could use a contiguous range for marking seen boards which should also improve the runtime.

Maybe it's also possible to eliminate duplications before inserting so that you don't have to do that as well?

As for converting the the representation of whatever these white/black_pieces things are to bits, why do that? Simply saving the number as-is and then reading it will also yield the bits underneath and you can convert back to a string representation when (and if) you actually need it. I'm also not entirely sure if using a bitset for that is optimal. It's been a while since I run some benchmarks with it, but std::bitset is not generally praised as an optimal solution for converting to binary representations.

In the end, what I'm proposing may not yield significant results, so I'd benchmark anyway if you care that much about speed, but I wanted to leave an insight, since it's an important matter.