Closed valterartur closed 1 year ago
Just tested with 2.46, same situation
Is this a known issue or a configuration problem?
No, the incident you referred to was caused by overzealous retries never giving up on a hard error.
Are there any recommended configurations or workarounds to prevent zombie processes without switching to synchronous archiving?
No, since this is not a known problem.
I just ran your repro in a clean test environment. Then ran ps aux | grep pgbackrest
. First time:
postgres 872 0.0 0.1 73844 11632 ? S 15:42 0:00 pgbackrest --exec-id=870-9dc0709b --log-level-console=off --log-level-stderr=off --stanza=demo archive-push:async /var/lib/postgresql/12/demo/pg_wal
postgres 873 0.0 0.0 11744 5772 ? S 15:42 0:00 ssh -o LogLevel=error -o Compression=no -o PasswordAuthentication=no pgbackrest@repository pgbackrest --archive-async --exec-id=870-9dc0709b --log-level-console=off --log-level-file=off --log-level-stderr=error --no-log-timestamp --process=0 --remote-type=repo --repo=1 --stanza=demo archive-push:remote
postgres 874 93.0 0.3 78112 26052 ? R 15:42 0:19 pgbackrest --exec-id=870-9dc0709b --log-level-console=off --log-level-file=off --log-level-stderr=error --process=1 --remote-type=repo --stanza=demo archive-push:local
postgres 875 93.2 0.3 78112 26188 ? R 15:42 0:19 pgbackrest --exec-id=870-9dc0709b --log-level-console=off --log-level-file=off --log-level-stderr=error --process=2 --remote-type=repo --stanza=demo archive-push:local
postgres 876 4.4 0.0 12148 6416 ? S 15:42 0:00 ssh -o LogLevel=error -o Compression=no -o PasswordAuthentication=no pgbackrest@repository pgbackrest --archive-async --exec-id=870-9dc0709b --log-level-console=off --log-level-file=off --log-level-stderr=error --no-log-timestamp --process=1 --remote-type=repo --repo=1 --stanza=demo archive-push:remote
postgres 877 4.4 0.0 12304 6468 ? S 15:42 0:00 ssh -o LogLevel=error -o Compression=no -o PasswordAuthentication=no pgbackrest@repository pgbackrest --archive-async --exec-id=870-9dc0709b --log-level-console=off --log-level-file=off --log-level-stderr=error --no-log-timestamp --process=2 --remote-type=repo --repo=1 --stanza=demo archive-push:remote
postgres 1012 0.0 0.0 2060 476 ? S 15:43 0:00 sh -c pgbackrest --stanza=demo archive-push pg_wal/00000001000000000000008D
postgres 1013 0.0 0.1 57408 10472 ? S 15:43 0:00 pgbackrest --stanza=demo archive-push pg_wal/00000001000000000000008D
postgres 1015 0.0 0.0 5972 620 pts/1 S+ 15:43 0:00 grep pgbackrest
Second time:
postgres 1084 0.0 0.0 5972 612 pts/1 S+ 15:43 0:00 grep pgbackrest
That looks pretty much like I would expect. But of course this test does not include running with Stolon, so perhaps there is something going on there.
I see this in the logs so sure looks like it is exiting normally:
2023-06-20 16:31:49.179 P00 DEBUG: main::main: => 0
Sorry for the late response, I wanted to share some findings that could be beneficial for those who might encounter a similar situation in the future.
For anyone using Stolon in a containerized environment, it's crucial to ensure that Stolon is not running as the init process in your container. The init process has specific responsibilities that typical application processes are not designed to handle.
In the case of Stolon, it isn't designed to manage processes as the init process would. Instead, I recommend using a utility like tini to manage your Stolon process.
For more detailed understanding: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/
Sorry for the late response, I wanted to share some findings that could be beneficial for those who might encounter a similar situation in the future.
Thanks for the response and for reporting this. No doubt it will be useful information for others.
Please provide the following information when submitting an issue (feature requests or general comments can skip this):
pgBackRest version: pgbackrest version pgBackRest 2.39
PostgreSQL version: PostgreSQL 13.7 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, 64-bit
Operating system/version - if you have more than one server (for example, a database server, a repository host server, one or more standbys), please specify each: Operating System: Ubuntu 22.04.2 LTS Deployment: Kubernetes using Stolon
Did you install pgBackRest from source or from a package? Built from source: pgBackRest 2.39
Please attach the following as applicable:
pgbackrest.conf
file(s)[backup] pg1-path=/stolon-data/postgres
effective_io_concurrency = '200' timezone = 'Etc/UTC' checkpoint_timeout = '15min' datestyle = 'iso, mdy' checkpoint_completion_target = '0.9' listen_addresses = '10.244.33.10' ssl = 'on' max_wal_size = '30GB' ssl_cert_file = '/certs/tls.crt' log_timezone = 'Etc/UTC' min_wal_size = '80MB' ssl_key_file = '/certs/tls.key' max_replication_slots = '20' default_text_search_config = 'pg_catalog.english' ssl_ca_file = '/certs/ca.crt' dynamic_shared_memory_type = 'posix' port = '5432' synchronous_standby_names = '' archive_mode = 'on' shared_buffers = '2048MB' wal_level = 'logical' hot_standby = 'on' wal_keep_size = '128' shared_preload_libraries = 'decoderbufs,wal2json' max_connections = '1000' archive_command = 'PGPASSWORD=$PGADMINPWD pgbackrest --stanza=backup archive-push %p' unix_socket_directories = '/tmp' max_wal_senders = '47'
ls -a /var/log/pgbackrest/ . ..
CREATE TABLE my_large_table ( id SERIAL PRIMARY KEY, name VARCHAR(255), age INT, address VARCHAR(255), phone_number VARCHAR(20), email VARCHAR(255), created_at TIMESTAMP );
CREATE OR REPLACE FUNCTION insert_large_data() RETURNS VOID AS $$ DECLARE batch_size INT := 1000; total_batches INT := 10000; i INT := 0; BEGIN WHILE i < total_batches LOOP INSERT INTO my_large_table (name, age, address, phone_number, email, created_at) SELECT md5(random()::text) AS name, floor(random() 100)::INT AS age, md5(random()::text) AS address, floor(random() 10000000000)::BIGINT::VARCHAR AS phone_number, md5(random()::text) || '@example.com' AS email, NOW() AS created_at FROM generate_series(1, batch_size);
END; $$ LANGUAGE plpgsql; SELECT insert_large_data(); DROP FUNCTION insert_large_data();
stolon@postgres-keeper-0:/$ ps aux | grep pgbackrest stolon 8463 0.0 0.0 0 0 ? Z 16:08 0:00 [pgbackrest]
stolon 8475 0.0 0.0 0 0 ? Z 16:08 0:00 [pgbackrest]
stolon 8514 0.0 0.0 0 0 ? Z 16:08 0:00 [pgbackrest]
stolon 8562 0.0 0.0 0 0 ? Z 16:09 0:00 [pgbackrest]
stolon 8623 0.0 0.0 0 0 ? Z 16:09 0:00 [pgbackrest]
stolon 8710 0.0 0.0 0 0 ? Z 16:09 0:00 [pgbackrest]
stolon 8789 0.0 0.0 0 0 ? Z 16:09 0:00 [pgbackrest]
stolon 8861 0.0 0.0 0 0 ? Z 16:09 0:00 [pgbackrest]
stolon 8956 0.0 0.0 0 0 ? Z 16:10 0:00 [pgbackrest]
stolon 9028 0.0 0.0 0 0 ? Z 16:10 0:00 [pgbackrest]
stolon 9117 0.0 0.0 0 0 ? Z 16:10 0:00 [pgbackrest]
stolon 9190 0.0 0.0 0 0 ? Z 16:10 0:00 [pgbackrest]
stolon 9281 0.0 0.0 0 0 ? Z 16:10 0:00 [pgbackrest]
stolon 9379 0.0 0.0 0 0 ? Z 16:11 0:00 [pgbackrest]
stolon 9457 0.0 0.0 0 0 ? Z 16:11 0:00 [pgbackrest]
stolon 9533 0.0 0.0 0 0 ? Z 16:11 0:00 [pgbackrest]
stolon 9599 0.0 0.0 0 0 ? Z 16:11 0:00 [pgbackrest]
stolon 10315 0.0 0.0 6612 2312 pts/2 S+ 16:14 0:00 grep --color=auto pgbackrest
stolon@postgres-keeper-0:/proc/8463$ ps 1 PID TTY STAT TIME COMMAND 1 ? Ssl 0:50 stolon-keeper --data-dir /stolon-data --log-level warn