Experiencing a high number of zombie processes with archive-push in async mode

valterartur commented 1 year ago

Please provide the following information when submitting an issue (feature requests or general comments can skip this):

pgBackRest version: pgbackrest version pgBackRest 2.39
PostgreSQL version: PostgreSQL 13.7 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, 64-bit
Operating system/version - if you have more than one server (for example, a database server, a repository host server, one or more standbys), please specify each: Operating System: Ubuntu 22.04.2 LTS Deployment: Kubernetes using Stolon
Did you install pgBackRest from source or from a package? Built from source: pgBackRest 2.39

Please attach the following as applicable:

pgbackrest.conf file(s)


[global]
repo1-path=/primary
repo1-type=s3
repo1-s3-endpoint=minio:9000
repo1-s3-bucket=pg-archive
repo1-s3-verify-tls=y
repo1-storage-ca-file=/certs/ca.crt
repo1-s3-region=us-east-1
repo1-s3-uri-style=path
repo1-retention-full=1
repo1-bundle=y
repo1-bundle-limit=10MiB
repo1-bundle-size=30MiB
start-fast=y
stop-auto=y
io-timeout=900
process-max=4
log-level-console=info
log-level-file=debug
log-path=/tmp
delta=y
archive-async=y
spool-path=/stolon-data/pgbackrest
archive-push-queue-max = 100GiB
exclude=postgresql.auto.conf

[backup] pg1-path=/stolon-data/postgres

    - `postgresql.conf` settings applicable to pgBackRest (`archive_command`, `archive_mode`, `listen_addresses`, `max_wal_senders`, `wal_level`, `port`)

effective_io_concurrency = '200' timezone = 'Etc/UTC' checkpoint_timeout = '15min' datestyle = 'iso, mdy' checkpoint_completion_target = '0.9' listen_addresses = '10.244.33.10' ssl = 'on' max_wal_size = '30GB' ssl_cert_file = '/certs/tls.crt' log_timezone = 'Etc/UTC' min_wal_size = '80MB' ssl_key_file = '/certs/tls.key' max_replication_slots = '20' default_text_search_config = 'pg_catalog.english' ssl_ca_file = '/certs/ca.crt' dynamic_shared_memory_type = 'posix' port = '5432' synchronous_standby_names = '' archive_mode = 'on' shared_buffers = '2048MB' wal_level = 'logical' hot_standby = 'on' wal_keep_size = '128' shared_preload_libraries = 'decoderbufs,wal2json' max_connections = '1000' archive_command = 'PGPASSWORD=$PGADMINPWD pgbackrest --stanza=backup archive-push %p' unix_socket_directories = '/tmp' max_wal_senders = '47'

    - errors in the postgresql log file before or during the time you experienced the issue
    - log file in `/var/log/pgbackrest` for the commands run (e.g. `/var/log/pgbackrest/mystanza_backup.log`)

ls -a /var/log/pgbackrest/ . ..

We have also configured debug logs to file but it states that all the operations were successful. Attaching the debug log
[backup-archive-push-async.log](https://github.com/pgbackrest/pgbackrest/files/11803510/backup-archive-push-async.log)

7. Describe the issue:
We are experiencing an issue where running archive-push in async mode causes a large number of zombie processes to be generated. Our database is highly dynamic with extensive insertions and deletions. Over time, the accumulation of zombie processes makes PostgreSQL extremely slow and almost inoperable. Restarting the instance temporarily resolves the issue.

This issue bears similarities to [#1116](https://github.com/pgbackrest/pgbackrest/issues/1116), but we're encountering it in a newer version (2.39).

We are able to reproduce this issue by executing this simple script.

CREATE TABLE my_large_table ( id SERIAL PRIMARY KEY, name VARCHAR(255), age INT, address VARCHAR(255), phone_number VARCHAR(20), email VARCHAR(255), created_at TIMESTAMP );

CREATE OR REPLACE FUNCTION insert_large_data() RETURNS VOID AS $$ DECLARE batch_size INT := 1000; total_batches INT := 10000; i INT := 0; BEGIN WHILE i < total_batches LOOP INSERT INTO my_large_table (name, age, address, phone_number, email, created_at) SELECT md5(random()::text) AS name, floor(random() 100)::INT AS age, md5(random()::text) AS address, floor(random() 10000000000)::BIGINT::VARCHAR AS phone_number, md5(random()::text) || '@example.com' AS email, NOW() AS created_at FROM generate_series(1, batch_size);

    i := i + 1;
END LOOP;

END; $$ LANGUAGE plpgsql; SELECT insert_large_data(); DROP FUNCTION insert_large_data();

When executed, this script reliably reproduces the issue with the accumulation of zombie processes.

stolon@postgres-keeper-0:/$ ps aux | grep pgbackrest stolon 8463 0.0 0.0 0 0 ? Z 16:08 0:00 [pgbackrest] stolon 8475 0.0 0.0 0 0 ? Z 16:08 0:00 [pgbackrest] stolon 8514 0.0 0.0 0 0 ? Z 16:08 0:00 [pgbackrest] stolon 8562 0.0 0.0 0 0 ? Z 16:09 0:00 [pgbackrest] stolon 8623 0.0 0.0 0 0 ? Z 16:09 0:00 [pgbackrest] stolon 8710 0.0 0.0 0 0 ? Z 16:09 0:00 [pgbackrest] stolon 8789 0.0 0.0 0 0 ? Z 16:09 0:00 [pgbackrest] stolon 8861 0.0 0.0 0 0 ? Z 16:09 0:00 [pgbackrest] stolon 8956 0.0 0.0 0 0 ? Z 16:10 0:00 [pgbackrest] stolon 9028 0.0 0.0 0 0 ? Z 16:10 0:00 [pgbackrest] stolon 9117 0.0 0.0 0 0 ? Z 16:10 0:00 [pgbackrest] stolon 9190 0.0 0.0 0 0 ? Z 16:10 0:00 [pgbackrest] stolon 9281 0.0 0.0 0 0 ? Z 16:10 0:00 [pgbackrest] stolon 9379 0.0 0.0 0 0 ? Z 16:11 0:00 [pgbackrest] stolon 9457 0.0 0.0 0 0 ? Z 16:11 0:00 [pgbackrest] stolon 9533 0.0 0.0 0 0 ? Z 16:11 0:00 [pgbackrest] stolon 9599 0.0 0.0 0 0 ? Z 16:11 0:00 [pgbackrest] stolon 10315 0.0 0.0 6612 2312 pts/2 S+ 16:14 0:00 grep --color=auto pgbackrest

These zombie processes have stolon-keeper (PID 1) as their parent process.

stolon@postgres-keeper-0:/proc/8463$ ps 1 PID TTY STAT TIME COMMAND 1 ? Ssl 0:50 stolon-keeper --data-dir /stolon-data --log-level warn


Switching to synchronous archiving eliminates the zombie processes but we are concerned about potential performance impacts.

Questions:
Is this a known issue or a configuration problem?
Are there any recommended configurations or workarounds to prevent zombie processes without switching to synchronous archiving?

valterartur commented 1 year ago

Just tested with 2.46, same situation

dwsteele commented 1 year ago

Is this a known issue or a configuration problem?

No, the incident you referred to was caused by overzealous retries never giving up on a hard error.

Are there any recommended configurations or workarounds to prevent zombie processes without switching to synchronous archiving?

No, since this is not a known problem.

I just ran your repro in a clean test environment. Then ran ps aux | grep pgbackrest. First time:

postgres   872  0.0  0.1  73844 11632 ?        S    15:42   0:00 pgbackrest --exec-id=870-9dc0709b --log-level-console=off --log-level-stderr=off --stanza=demo archive-push:async /var/lib/postgresql/12/demo/pg_wal
postgres   873  0.0  0.0  11744  5772 ?        S    15:42   0:00 ssh -o LogLevel=error -o Compression=no -o PasswordAuthentication=no pgbackrest@repository pgbackrest --archive-async --exec-id=870-9dc0709b --log-level-console=off --log-level-file=off --log-level-stderr=error --no-log-timestamp --process=0 --remote-type=repo --repo=1 --stanza=demo archive-push:remote
postgres   874 93.0  0.3  78112 26052 ?        R    15:42   0:19 pgbackrest --exec-id=870-9dc0709b --log-level-console=off --log-level-file=off --log-level-stderr=error --process=1 --remote-type=repo --stanza=demo archive-push:local
postgres   875 93.2  0.3  78112 26188 ?        R    15:42   0:19 pgbackrest --exec-id=870-9dc0709b --log-level-console=off --log-level-file=off --log-level-stderr=error --process=2 --remote-type=repo --stanza=demo archive-push:local
postgres   876  4.4  0.0  12148  6416 ?        S    15:42   0:00 ssh -o LogLevel=error -o Compression=no -o PasswordAuthentication=no pgbackrest@repository pgbackrest --archive-async --exec-id=870-9dc0709b --log-level-console=off --log-level-file=off --log-level-stderr=error --no-log-timestamp --process=1 --remote-type=repo --repo=1 --stanza=demo archive-push:remote
postgres   877  4.4  0.0  12304  6468 ?        S    15:42   0:00 ssh -o LogLevel=error -o Compression=no -o PasswordAuthentication=no pgbackrest@repository pgbackrest --archive-async --exec-id=870-9dc0709b --log-level-console=off --log-level-file=off --log-level-stderr=error --no-log-timestamp --process=2 --remote-type=repo --repo=1 --stanza=demo archive-push:remote
postgres  1012  0.0  0.0   2060   476 ?        S    15:43   0:00 sh -c pgbackrest --stanza=demo archive-push pg_wal/00000001000000000000008D
postgres  1013  0.0  0.1  57408 10472 ?        S    15:43   0:00 pgbackrest --stanza=demo archive-push pg_wal/00000001000000000000008D
postgres  1015  0.0  0.0   5972   620 pts/1    S+   15:43   0:00 grep pgbackrest

Second time:

postgres  1084  0.0  0.0   5972   612 pts/1    S+   15:43   0:00 grep pgbackrest

That looks pretty much like I would expect. But of course this test does not include running with Stolon, so perhaps there is something going on there.

I see this in the logs so sure looks like it is exiting normally:

2023-06-20 16:31:49.179 P00  DEBUG:     main::main: => 0

valterartur commented 1 year ago

Sorry for the late response, I wanted to share some findings that could be beneficial for those who might encounter a similar situation in the future.

For anyone using Stolon in a containerized environment, it's crucial to ensure that Stolon is not running as the init process in your container. The init process has specific responsibilities that typical application processes are not designed to handle.

In the case of Stolon, it isn't designed to manage processes as the init process would. Instead, I recommend using a utility like tini to manage your Stolon process.

For more detailed understanding: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

dwsteele commented 1 year ago

Sorry for the late response, I wanted to share some findings that could be beneficial for those who might encounter a similar situation in the future.

Thanks for the response and for reporting this. No doubt it will be useful information for others.

pgbackrest / pgbackrest

Experiencing a high number of zombie processes with archive-push in async mode #2098