reactive-tech / kubegres

Kubegres is a Kubernetes operator allowing to deploy one or many clusters of PostgreSql instances and manage databases replication, failover and backup.
https://www.kubegres.io
Apache License 2.0
1.32k stars 74 forks source link

Kubegres pods are restarting again and again and creating new replicas #125

Open richamishra006 opened 2 years ago

richamishra006 commented 2 years ago

Hi Team, I have deployed kubegres with three replicas, the pod count is something like postgresql-32-0 , postgresql-34-0 , postgresql-35-0 . The pods are restarting and creating new replica, I am unable to find that whats causing this. I am adding the logs here

2022-07-25 02:01:14.159 GMT [1] LOG:  starting PostgreSQL 13.2 (Debian 13.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2022-07-25 02:01:14.190 GMT [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-07-25 02:01:14.190 GMT [1] LOG:  listening on IPv6 address "::", port 5432
2022-07-25 02:01:14.296 GMT [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-07-25 02:01:14.504 GMT [31] LOG:  database system was interrupted; last known up at 2022-07-25 01:59:40 GMT
2022-07-25 02:01:15.089 GMT [31] LOG:  entering standby mode
2022-07-25 02:01:15.218 GMT [31] LOG:  redo starts at 5/E4000028
2022-07-25 02:01:17.128 GMT [31] LOG:  consistent recovery state reached at 5/E46CD688
2022-07-25 02:01:17.138 GMT [1] LOG:  database system is ready to accept read only connections
2022-07-25 02:01:17.464 GMT [41] LOG:  started streaming WAL from primary at 5/E5000000 on timeline 21
2022-07-25 04:01:48.575 GMT [13264] ERROR:  canceling statement due to conflict with recovery
2022-07-25 04:01:48.575 GMT [13264] DETAIL:  User query might have needed to see row versions that must be removed.
2022-07-25 04:01:48.575 GMT [13264] STATEMENT:  COPY public.reversion_version (id, object_id, format, serialized_data, object_repr, content_type_id, revision_id, db) TO stdout;
2022-07-25 05:08:07.106 GMT [41] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2022-07-25 05:08:07.107 GMT [31] LOG:  invalid resource manager ID 32 at 6/754D1B0
2022-07-25 05:08:07.237 GMT [20697] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:08:12.121 GMT [20717] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:08:17.130 GMT [20727] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:08:22.177 GMT [20742] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:08:27.155 GMT [20749] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:08:32.146 GMT [20764] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:08:37.151 GMT [20765] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:08:42.163 GMT [20766] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:08:47.205 GMT [20767] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:08:52.172 GMT [20768] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:08:57.170 GMT [20769] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:09:02.187 GMT [20770] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:09:07.183 GMT [20771] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:09:12.193 GMT [20772] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:09:17.203 GMT [20812] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 05:09:22.213 GMT [20813] LOG:  fetching timeline history file for timeline 22 from primary server
2022-07-25 05:09:22.251 GMT [20813] LOG:  started streaming WAL from primary at 6/7000000 on timeline 21
2022-07-25 05:09:22.418 GMT [20813] LOG:  replication terminated by primary server
2022-07-25 05:09:22.418 GMT [20813] DETAIL:  End of WAL reached on timeline 21 at 6/754D1B0.
2022-07-25 05:09:22.427 GMT [31] LOG:  new target timeline is 22
2022-07-25 05:09:22.534 GMT [20813] LOG:  restarted WAL streaming at 6/7000000 on timeline 22
2022-07-25 06:00:36.857 GMT [24859] LOG:  duration: 7022.961 ms  statement: COPY public.campaign (id, created, modified, filters, suid, title, priority, status, actions, action_type, substitutes, groups, type, active, smart, monitoring, weekdays, start_at, end_at, best_time, activation, timezone, activates_at, expires_at, finished, feed, app_id, creator_id, segment_id, rematch_repeat, rematch_duration, target, metric, champion, split_id, product, feed_deletion_conditions, tag_filters, goal_id, operating_system, goals, feed_repeat_conditions, last_editor_id, trigger_conditions) TO stdout;
2022-07-25 06:02:21.215 GMT [24859] LOG:  duration: 99503.629 ms  statement: COPY public.reversion_version (id, object_id, format, serialized_data, object_repr, content_type_id, revision_id, db) TO stdout;
2022-07-25 06:12:07.603 GMT [20813] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2022-07-25 06:12:07.604 GMT [31] LOG:  record with incorrect prev-link 636B71A1/910 at 6/136DC6A8
2022-07-25 06:12:07.778 GMT [26358] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 06:12:12.678 GMT [26375] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 06:12:17.681 GMT [26384] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 06:12:22.640 GMT [26406] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 06:12:27.761 GMT [26414] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 06:12:34.846 GMT [26426] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 06:12:38.093 GMT [26433] FATAL:  could not connect to the primary server: could not translate host name "pointzi-postgresql" to address: Name or service not known
2022-07-25 06:12:39.203 GMT [1] LOG:  received fast shutdown request
2022-07-25 06:12:39.305 GMT [1] LOG:  aborting any active transactions
2022-07-25 06:12:39.932 GMT [32] LOG:  shutting down
2022-07-25 06:12:40.281 GMT [1] LOG:  database system is shut down

Please help me with this, any suggestions will be higly appreciated

teebu commented 2 years ago

How did you solve this? we're also seeing could not connect to the primary server