goharbor / harbor-helm

The helm chart to deploy Harbor
Apache License 2.0
1.17k stars 760 forks source link

harbor-db Always in recovery mode #1804

Open Tebby0753 opened 1 month ago

Tebby0753 commented 1 month ago

Hello: I used helm to deploy harbor, and it has been running for about more than a year, but recently harbor-db often prints: the database system is in recovery mode. I see that the monitoring resources are normal. The following is the printed log:

2024-08-05 10:07:43.816 UTC [2480] FATAL: the database system is in recovery mode 2024-08-05 10:07:53.812 UTC [2490] FATAL: the database system is in recovery mode 2024-08-05 10:08:02.040 UTC [1635] LOG: database system was not properly shut down; automatic recovery in progress 2024-08-05 10:08:03.465 UTC [1635] LOG: invalid record length at 1/41B81728: wanted 24, got 0 2024-08-05 10:08:03.465 UTC [1635] LOG: redo is not required 2024-08-05 10:08:03.825 UTC [2501] FATAL: the database system is in recovery mode 2024-08-05 10:08:07.401 UTC [1] LOG: database system is ready to accept connections 2024-08-05 10:08:17.371 UTC [1] LOG: server process (PID 2514) exited with exit code 141 2024-08-05 10:08:17.371 UTC [1] LOG: terminating any other active server processes 2024-08-05 10:08:17.371 UTC [2517] WARNING: terminating connection because of crash of another server process 2024-08-05 10:08:17.371 UTC [2517] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2024-08-05 10:08:17.371 UTC [2517] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2024-08-05 10:08:17.372 UTC [2505] WARNING: terminating connection because of crash of another server process 2024-08-05 10:08:17.372 UTC [2505] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2024-08-05 10:08:17.372 UTC [2505] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2024-08-05 10:08:18.251 UTC [1] LOG: all server processes terminated; reinitializing 2024-08-05 10:08:18.548 UTC [2518] LOG: database system was interrupted; last known up at 2024-08-05 10:08:07 UTC 2024-08-05 10:08:23.812 UTC [2528] FATAL: the database system is in recovery mode 2024-08-05 10:08:33.811 UTC [2538] FATAL: the database system is in recovery mode 2024-08-05 10:08:43.813 UTC [2548] FATAL: the database system is in recovery mode 2024-08-05 10:08:44.191 UTC [2558] FATAL: the database system is in recovery mode 2024-08-05 10:08:53.810 UTC [2568] FATAL: the database system is in recovery mode 2024-08-05 10:09:03.813 UTC [2578] FATAL: the database system is in recovery mode 2024-08-05 10:09:13.816 UTC [2587] FATAL: the database system is in recovery mode 2024-08-05 10:09:23.817 UTC [2597] FATAL: the database system is in recovery mode 2024-08-05 10:09:33.813 UTC [2607] FATAL: the database system is in recovery mode 2024-08-05 10:09:43.810 UTC [2617] FATAL: the database system is in recovery mode 2024-08-05 10:09:50.197 UTC [2627] FATAL: the database system is in recovery mode 2024-08-05 10:09:53.812 UTC [2637] FATAL: the database system is in recovery mode 2024-08-05 10:10:03.816 UTC [2647] FATAL: the database system is in recovery mode 2024-08-05 10:10:13.809 UTC [2656] FATAL: the database system is in recovery mode 2024-08-05 10:10:23.816 UTC [2666] FATAL: the database system is in recovery mode

The log will keep looping... i want to know what happened,Thanks!

Tebby0753 commented 1 month ago

harbor-db resource monitoring: WX20240805-182237

zyyw commented 1 month ago

According to this resources:

This error of:

2024-08-05 10:07:53.812 UTC [2490] FATAL: the database system is in recovery mode
2024-08-05 10:08:02.040 UTC [1635] LOG: database system was not properly shut down; automatic recovery in progress

2024-08-05 10:08:17.371 UTC [2517] WARNING: terminating connection because of crash of another server process
2024-08-05 10:08:17.371 UTC [2517] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2024-08-05 10:08:17.371 UTC [2517] HINT: In a moment you should be able to reconnect to the database and repeat your command.

may have something to do with the disk space utilization and memory allocation in your environment.

abinet commented 5 days ago

We are experiencing same problem now. Harbor version 2.10, internal PostgreSQL database. Following pattern is observable multiple times per day:

2024-09-09 16:59:39.372 UTC [1] LOG:  server process (PID 1129079) exited with exit code 141ESC[0K
2024-09-09 16:59:39.372 UTC [1] LOG:  terminating any other active server processesESC[0K
2024-09-09 16:59:39.410 UTC [1] LOG:  all server processes terminated; reinitializingESC[0K
2024-09-09 16:59:39.473 UTC [1129087] LOG:  database system was interrupted; last known up at 2024-09-09 16:02:59 UTCESC[0K
2024-09-09 16:59:41.136 UTC [1129087] LOG:  database system was not properly shut down; automatic recovery in progressESC[0K
2024-09-09 16:59:41.159 UTC [1129087] LOG:  redo starts at 1/36EC9F70ESC[0K
2024-09-09 16:59:41.160 UTC [1129087] LOG:  invalid record length at 1/36ECA070: wanted 24, got 0ESC[0K
2024-09-09 16:59:41.160 UTC [1129087] LOG:  redo done at 1/36ECA038 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 sESC[0K
2024-09-09 16:59:41.225 UTC [1] LOG:  database system is ready to accept connectionsESC[0K
2024-09-09 16:59:47.355 UTC [1129115] LOG:  PID 1128695 in cancel request did not match any processESC[0K
2024-09-09 16:59:49.162 UTC [1129116] LOG:  PID 1128696 in cancel request did not match any processESC[0K
2024-09-09 16:59:49.163 UTC [1129117] LOG:  PID 1128718 in cancel request did not match any processESC[0K
2024-09-09 16:59:58.588 UTC [1129139] LOG:  PID 1129004 in cancel request did not match any processESC[0K
2024-09-09 17:00:07.196 UTC [1129164] LOG:  PID 1128841 in cancel request did not match any processESC[0K

Extending memory resources for database pod does not help. I am not able to see any anomalies in Harbor Grafana Dashboards (although there is no postgres specific dashboard provided). There are neither OOM killer logs nor OOM Kubernetes events.

Here I found similar issues https://github.com/goharbor/harbor/issues/20272, https://github.com/goharbor/harbor-helm/issues/1184 but they are closed already without solution being provided.

General PostgreSQL suggestion in such cases is to investigate core dumps. I am not sure if Harbor Database Pod has been configured to write core dumps, nor allows such a configuration.

Any suggestions and ideas how to troubleshoot this are very appreciated.