hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.07k stars 112 forks source link

ERROR history file "00000202.history" contains 1024 lines, pg_autoctl only supports up to 1023 lines #991

Closed dbamu closed 1 year ago

dbamu commented 1 year ago

I have repeatedly performed failover tests.

test version pgf 2.0 postgresql 13.10

step1. Generate loads using pgbench on the primary and secondary. step2. The pg_autoctl perform failover command continues to be executed periodically.

Failover was performed repeatedly and then stopped. The log is:


#### check current state of the formation
$ pg_autoctl show state --formation test
         Name |  Node |          Host:Port |         TLI: LSN |   Connection |      Reported State |      Assigned State
 -------------+-------+--------------------+------------------+--------------+---------------------+--------------------
dev-pgf200003 |     3 | dev-pgf200003:5432 | 514: E2/710000D8 |   read-write |        wait_primary |        wait_primary
dev-pgf200002 |    21 | dev-pgf200002:5432 |           1: 0/0 |       none ! |        wait_standby |          catchingup

#### after drop node, execute "pg_autoctl create postgres" command in secondary
$ pg_autoctl create postgres \
 --pgctl $CmdPath \ 
 --pgdata $PGDATA \
 --pghost `hostname` \
 --name `hostname` \
 --pgport 5432 \
 --hostname `hostname` \
 --formation test --skip-pg-hba --no-ssl --maximum-backup-rate 1024M --monitor postgres://autoctl_node@dev-pgf200001:5432/pg_auto_failover

10:32:59 130213 WARN  PG_REGRESS_SOCK_DIR is set to "$path", and our setup is using "dev-pgf200002"
10:32:59 130213 INFO  Continuing from a previous `pg_autoctl create` failed attempt
10:32:59 130213 INFO  PostgreSQL state at registration time was: PGDATA does not exist
10:32:59 130213 INFO  FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby
10:32:59 130213 INFO  Initialising PostgreSQL as a hot standby
10:32:59 130213 WARN  PG_REGRESS_SOCK_DIR is set to "$path", and our setup is using "dev-pgf200003"
-10:32:59 130213 ERROR history file "00000202.history" contains 1024 lines, pg_autoctl only supports up to 1023 lines
10:32:59 130213 ERROR Failed to connect to the primary with a replication connection string. See above for details
10:32:59 130213 ERROR Failed to initialize standby server, see above for details
10:32:59 130213 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.
10:33:00 130203 ERROR pg_autoctl service node-init exited with exit status 12
10:33:00 130203 FATAL pg_autoctl service node-init has already been restarted 5 times in the last 1 seconds, stopping now
10:33:00 130205 INFO  Postgres controller service received signal SIGTERM, terminating
10:33:00 130203 FATAL Something went wrong in sub-process supervision, stopping now. See above for details.
10:33:00 130203 INFO  Stop pg_autoctl

#### check current state of the formation
$ pg_autoctl show state --formation test
         Name |  Node |          Host:Port |         TLI: LSN |   Connection |      Reported State |      Assigned State
 -------------+-------+--------------------+------------------+--------------+---------------------+--------------------
dev-pgf200003 |     3 | dev-pgf200003:5432 | 514: E2/710000D8 |   read-write |        wait_primary |        wait_primary
dev-pgf200002 |    21 | dev-pgf200002:5432 |           1: 0/0 |       none ! |        wait_standby |          catchingup

#### check timeline history file on primary
$ cat 00000202.history | tail -10

509     D0/3714D748     no recovery target specified

510     D0/B09A8F40     no recovery target specified

511     D0/FC976330     no recovery target specified

512     D1/A50307D8     no recovery target specified

513     D1/F89E3FB8     no recovery target specified

$ cat 00000202.history | wc -l
1025

#### remove empty string
$ sed -i '/^$/d' 00000202.history 

#### retry "pg_autoctl create postgres " command in secondary
$ pg_autoctl drop node

$ pg_autoctl create postgres \
 --pgctl $CmdPath \ 
 --pgdata $PGDATA \
 --pghost `hostname` \
 --name `hostname` \
 --pgport 5432 \
 --hostname `hostname` \
 --formation test --skip-pg-hba --no-ssl --maximum-backup-rate 1024M --monitor postgres://autoctl_node@dev-pgf200001:5432/pg_auto_failover

nohup pg_autoctl run >> /home1/postgres/db/pglog/pg_autoctl.log 2>&1 &

$ pg_autoctl show state --formation test
         Name |  Node |          Host:Port |         TLI: LSN |   Connection |      Reported State |      Assigned State
 -------------+-------+--------------------+------------------+--------------+---------------------+--------------------
dev-pgf200003 |     3 | dev-pgf200003:5432 | 514: E2/73000110 |   read-write |             primary |             primary
dev-pgf200002 |    21 | dev-pgf200002:5432 | 514: E2/73000110 |    read-only |           secondary |           secondary

Checking the source code, the maximum lines of the .history file is set to 1024.

define PG_AUTOCTL_MAX_TIMELINES 1024

https://github.com/hapostgres/pg_auto_failover/blob/10c62c247b34ca6515f3bbf17008a4a31a2eb16b/src/bin/pg_autoctl/pgsql.h#L196-L210

I would like to know why you set PG_AUTOCTL_MAX_TIMELINES to 1024.

Information recorded in the timelineID.history file is not deleted. As a result of the test, failover is performed up to 513 times.

If there is no reason to set PG_AUTOCTL_MAX_TIMELINES to 1024, could you modify the PG_AUTOCTL_MAX_TIMELINES value to a very large value (e.g 1048576(2^20))?

hancci commented 1 year ago

I sincerely hope that this problem will be fixed.