replicase / pgcapture

A scalable Netflix DBLog implementation for PostgreSQL
Apache License 2.0
220 stars 31 forks source link

Fix corner cases in pgcapture.current_query() #63

Closed rjuju closed 2 months ago

rjuju commented 2 months ago

That function wasn't correctly handling multi-statement queries or CREATE EXTENSION commands that internally perform DDL.

I added first some minimal regression tests to show the current behavior with those corner cases and added the fix in the next commit. Note that I didn't test those changes with pglogical. I don't know much about this tool and I don't know if it can impact this approach. If it does you can however ensure that pgcapture is executed first by changing the order of the modules in the needed *_preload_libraries parameter.

I also modifed the makefile to allow overloading the pg_config tool. At least the RPM packagers rely on that so it would make their job easier if they want to package this tool.

I manually tested the changes on pg 12 to pg 16. I didn't add a github workflow to run the tests on all supported versions as there shouldn't be much activity in that repository, but I can do it if you want.

rjuju commented 2 months ago

With the reported failures I realized that there is at least partial testing for the extension (although not its new regression tests), and that I forgot to add a new header file in the last commit. That's now fixed.

The circleci tests still fail with pg12 in TestPulsarReaderSource, but I don't know anything about pulsar (or I don't write go), so I have no idea what the problem is. I'm also not familiar with circleci, so I'm assuming that the failure for the rest of the pg versions are just due to pg12 failing.

rueian commented 2 months ago

Hi @rjuju,

Wow, huge thanks! We previously didn't know how to parse the input queries accurately, so we just logged the queries and replayed them as is at downstream replicas. Thank you for guiding us here, especially the use of ProcessUtility_hook. We will figure out how to cooperate with the new parser.

cc @KennyChenFight, @benjamin99

rjuju commented 2 months ago

Thanks a lot!

rueian commented 2 months ago

The circleci tests still fail with pg12 in TestPulsarReaderSource, but I don't know anything about pulsar (or I don't write go), so I have no idea what the problem is.

It was flaky. Now all tests are passed.

rjuju commented 2 months ago

Oh ok, good news then :)

Thanks for restarting the tests!

rjuju commented 2 months ago

By the way, what is the project policy for postgres version compatibility checks? Pg 11 is now EOF, pg 12 will soon be too but pg 17 should be released within a month or two (it's currently in beta 3).

KennyChenFight commented 2 months ago

By the way, what is the project policy for postgres version compatibility checks? Pg 11 is now EOF, pg 12 will soon be too but pg 17 should be released within a month or two (it's currently in beta 3).

Well, I think pgcapture should follow PostgreSQL's release flow. The awkward part is that our company is still using PostgreSQL 11, and currently, pgcapture still occasionally drops messages, which I am still investigating (mainly it should be an issue with Pulsar).

rjuju commented 2 months ago

Got it. For extensions I usually like to provide as much backward compatibility as possible, as for various reasons it's not always to timely upgrade to a new version and the extension shouldn't force you to upgrade. It's usually the testing part that can be problematic as it can increase costs (or just be too much for free plans). I tested on current postgres HEAD (which is the pg 18 devel) and the extension still compiles and passes the regression tests. I can work on a PR to add pg 17 to the CI once 17.0 is released if you want.

If you think there's any chance that the message drop can come from the postgres side let me know I will be happy to help digging into it.

KennyChenFight commented 2 months ago

Got it. For extensions I usually like to provide as much backward compatibility as possible, as for various reasons it's not always to timely upgrade to a new version and the extension shouldn't force you to upgrade. It's usually the testing part that can be problematic as it can increase costs (or just be too much for free plans). I tested on current postgres HEAD (which is the pg 18 devel) and the extension still compiles and passes the regression tests. I can work on a PR to add pg 17 to the CI once 17.0 is released if you want.

If you think there's any chance that the message drop can come from the postgres side let me know I will be happy to help digging into it.

Sure! Thanks a lot! Feel free to submit a PR to support PostgreSQL 17!

rjuju commented 2 months ago

That's very strange, the extracted query string should be available during the whole DDL execution. I will try to reproduce the problem locally and will report.

If you have any more details about the scenario, like if i'ts the first DDL ever, the first in a transaction, the first in a backend..., or any other relevant detail please let me know.

KennyChenFight commented 2 months ago

That's very strange, the extracted query string should be available during the whole DDL execution. I will try to reproduce the problem locally and will report.

If you have any more details about the scenario, like if i'ts the first DDL ever, the first in a transaction, the first in a backend..., or any other relevant detail please let me know.

Just run the only one DDL command and you will get the null query.

But after that, that doesn't happen again.

rjuju commented 2 months ago

Do you have some more details on how exactly to run something? Simply running ././dockerbuild.sh test doesn't really work (it fails on "postgres" address not being available. FWIW I'm using podman for containers, with suitable aliases to pretent I'm using docker). docker-compose run --rm test-deps fails with "keyerror: 'pg_'"

I tried PG_VERSION=11 ./dockerbuild.sh build but it also eventually failed, with "docker-compose: error: unrecognized arguments: --force-rm", after building replicase/postgres"11-logical

I'm assuming that the problem is caused by having pglogical?

KennyChenFight commented 2 months ago

Do you have some more details on how exactly to run something? Simply running ././dockerbuild.sh test doesn't really work (it fails on "postgres" address not being available. FWIW I'm using podman for containers, with suitable aliases to pretent I'm using docker). docker-compose run --rm test-deps fails with "keyerror: 'pg_'"

I tried PG_VERSION=11 ./dockerbuild.sh build but it also eventually failed, with "docker-compose: error: unrecognized arguments: --force-rm", after building replicase/postgres"11-logical

I'm assuming that the problem is caused by having pglogical?

Hi, there is a branch which rebased on yours. And run these commands to start postgres container

./dockerbuild.sh test
psql -p 5432 -U postgres -h localhost

and follow these instructions:

postgres=# CREATE EXTENSION IF NOT EXISTS pgcapture;
CREATE EXTENSION
postgres=# SELECT query, unnest(tags) FROM pgcapture.ddl_logs ORDER BY id;
 query | unnest 
-------+--------
(0 rows)

postgres=# \c
psql (16rc1, server 14.13)
You are now connected to database "postgres" as user "postgres".
postgres=# create table tbl1();
CREATE TABLE
postgres=# SELECT query, unnest(tags) FROM pgcapture.ddl_logs ORDER BY id;
 query | unnest 
-------+--------
(0 rows)

postgres=# create table tbl2();
CREATE TABLE
postgres=# SELECT query, unnest(tags) FROM pgcapture.ddl_logs ORDER BY id;
        query        |    unnest    
---------------------+--------------
 create table tbl2() | CREATE TABLE
(1 row)

You can find that the first DDL commands after establishing a connection will not trigger the hook. But it will trigger normally afterwards.

I also tried the version of without pglogical, the problem still remains.

rjuju commented 2 months ago

For some reason it still wasn't working, I managed to make it work using

PG_VERSION=16 ./dockerbuild.sh test

not sure what's going on but it seems irelevant.

As far as I can see I can reliably reproduce the issue like this:

cat /tmp/repro.sql

DROP TABLE IF EXISTS tt;
TRUNCATE pgcapture.ddl_logs;
TABLE pgcapture.ddl_logs;
\c
CREATE TABLE tt();

SELECT query FROM pgcapture.ddl_logs;

and

$ psql -p 5432 -U postgres -Xf /tmp/repro.sql                                                                                                                                                                                                                                                                                                                                                           [0] 11/08/2024 21:46:26 [AC/DC]
DROP TABLE
TRUNCATE TABLE
 id | query | tags | activity 
----+-------+------+----------
(0 rows)

psql (15.7@8e16f81f3d, server 16.4)
WARNING: psql major version 15, server major version 16.
         Some psql features might not work.
You are now connected to database "postgres" as user "postgres".
CREATE TABLE
 query 
-------

(1 row)

Now, given the server configuration it doesn't seem surprising: the pgcapture extension is never explicitly loaded. I'm assuming something somewhere eventually calls one of the C function exposed in SQL, which has the side effect of loading the extension. If I instead load it explicitly:

$ cat /tmp/repro_fix.sql
DROP TABLE IF EXISTS tt;
TRUNCATE pgcapture.ddl_logs;
TABLE pgcapture.ddl_logs;
\c

\echo Loading the extension explicitly
LOAD 'pgcapture';
CREATE TABLE tt();

SELECT query FROM pgcapture.ddl_logs;

it works as expected:

$ psql -p 5432 -U postgres -Xf /tmp/repro_fix.sql
DROP TABLE
TRUNCATE TABLE
 id | query | tags | activity 
----+-------+------+----------
(0 rows)

psql (15.7@8e16f81f3d, server 16.4)
WARNING: psql major version 15, server major version 16.
         Some psql features might not work.
You are now connected to database "postgres" as user "postgres".
Loading the extension explicitly
LOAD
CREATE TABLE
       query       
-------------------
 CREATE TABLE tt()
(1 row)

Relying on some (for now unknown) indirect way to load the pgcapture extension is probably a pre-existing hazard and this patch just changed the timings enough that it's not working anymore (this patch relies on the extension being loaded before a DDL is executed, while the old patch doesn't, at least as long as you use a pglogical version that doesn't set the debug_query_string to NULL at the wrong moment). But it doesn't seem like a good idea to rely on such an indirect load.

It should be easily fixable by documenting the need to put pgcapture in session_preload_libraries (which is better than shared_preload_libraries in that case, as you can ship and use a new pgcapture version without having to restart the postgres instance, just kill any existing connection).

For instance:

$ psql -p 5432 -U postgres -Xc "ALTER SYSTEM SET session_preload_libraries = 'pgcapture'"
ALTER SYSTEM

$ psql -p 5432 -U postgres -Xc "SELECT pg_reload_conf()"
 pg_reload_conf 
----------------
 t
(1 row)

$ psql -p 5432 -U postgres -Xf /tmp/repro.sql
DROP TABLE
TRUNCATE TABLE
 id | query | tags | activity 
----+-------+------+----------
(0 rows)

psql (15.7@8e16f81f3d, server 16.4)
WARNING: psql major version 15, server major version 16.
         Some psql features might not work.
You are now connected to database "postgres" as user "postgres".
CREATE TABLE
       query       
-------------------
 CREATE TABLE tt()
(1 row)

The old "repro" that used to be broken now works.

KennyChenFight commented 2 months ago

@rjuju Thank you for guiding us! I think you could submit a commit to add pgcapture extension here. And I think there is no problem for this extension. @rueian @benjamin99

rjuju commented 2 months ago

Ah thanks, I didn't spot that configuration file before. I added an extra session_preload_libraries line there.

KennyChenFight commented 2 months ago

@rjuju

~~I found an awkward issue: Since we can now separate DDL and DML scenarios, our previous implementation of pulsar2pg would fail because it receives the DML event before the DDL event, causing an error (because the corresponding table does not exist yet). For example: CREATE TABLE multi(id integer); INSERT INTO multi SELECT 1;~~

Sorry, please ignore my mistake. 🙏 I think if using an example like CREATE TABLE multi(id integer); INSERT INTO multi SELECT 1;, the downstream pulsar2pg implementation will result in an error because multi table does not have a primary key, rather than receiving the insert event first and then receiving the DDL event causing an error.

So I think the current extension implementation should have no issues.

rjuju commented 2 months ago

thanks a lot for merging!

I haven't been able to test the whole solution yet but will keep working on that. in any case if this change brings any issue please let me know I will be happy to look into it and provide a fix.