percona / pmm

Percona Monitoring and Management: an open source database monitoring, observability and management tool
https://www.percona.com/software/database-tools/percona-monitoring-and-management
GNU Affero General Public License v3.0
675 stars 131 forks source link

PMM Server 2.36.0 can not restart successfully with pg failed. #1986

Open cdmikechen opened 1 year ago

cdmikechen commented 1 year ago

Description

I installed pxc-operator and pmm-server using helm-chart 1.12.1. When the pmm was first deployed, it started correctly. When the pod restarted, I found that the pg service was still failing.

2023-04-11 10:18:28,393 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:28,546 INFO exited: qan-api2 (exit status 1; not expected)
2023-04-11 10:18:29,261 INFO success: pmm-update-perform-init entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,261 INFO success: clickhouse entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,261 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,261 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,292 INFO success: victoriametrics entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,308 INFO success: vmalert entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,308 INFO success: alertmanager entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,308 INFO success: vmproxy entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,308 INFO success: pmm-managed entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,308 INFO success: pmm-agent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,422 INFO spawned: 'postgresql' with pid 153
2023-04-11 10:18:29,450 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:29,568 INFO spawned: 'qan-api2' with pid 155
2023-04-11 10:18:30,561 INFO success: qan-api2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:31,570 INFO spawned: 'postgresql' with pid 185
2023-04-11 10:18:31,942 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:35,111 INFO spawned: 'postgresql' with pid 231
2023-04-11 10:18:35,344 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:39,669 INFO spawned: 'postgresql' with pid 260
2023-04-11 10:18:39,833 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:44,966 INFO spawned: 'postgresql' with pid 344
2023-04-11 10:18:45,090 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:46,956 INFO exited: pmm-update-perform-init (exit status 0; expected)
2023-04-11 10:18:52,051 INFO spawned: 'postgresql' with pid 396
2023-04-11 10:18:52,090 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:59,145 INFO spawned: 'postgresql' with pid 397
2023-04-11 10:18:59,183 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:19:07,246 INFO spawned: 'postgresql' with pid 399
2023-04-11 10:19:07,269 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:19:16,478 INFO spawned: 'postgresql' with pid 402
2023-04-11 10:19:16,497 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:19:26,712 INFO spawned: 'postgresql' with pid 404
2023-04-11 10:19:26,734 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:19:27,713 INFO gave up: postgresql entered FATAL state, too many start retries too quickly

I checked pg logs in/src/logs and found that the pg directory permissions is not correct.

2023-04-11 10:18:52.087 UTC [396] FATAL:  data directory "/srv/postgres14" has invalid permissions
2023-04-11 10:18:52.087 UTC [396] DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2023-04-11 10:18:59.179 UTC [397] FATAL:  data directory "/srv/postgres14" has invalid permissions
2023-04-11 10:18:59.179 UTC [397] DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2023-04-11 10:19:07.267 UTC [399] FATAL:  data directory "/srv/postgres14" has invalid permissions
2023-04-11 10:19:07.267 UTC [399] DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2023-04-11 10:19:16.495 UTC [402] FATAL:  data directory "/srv/postgres14" has invalid permissions
2023-04-11 10:19:16.495 UTC [402] DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2023-04-11 10:19:26.731 UTC [404] FATAL:  data directory "/srv/postgres14" has invalid permissions
2023-04-11 10:19:26.731 UTC [404] DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).

I used the following commands to change the pg directory permissions and start pg. Pg started after the first change. But after I tried to restart pod, the directory permissions were forced to change by an unknown script or program. The repetition caused the above exception.

chmod 700 -R /srv/postgres14
su postgres -c "/usr/pgsql-14/bin/pg_ctl start -D /srv/postgres14"

Expected Results

Directory permission for postgres should not change, which is a mandatory restriction for pg startup.

Actual Results

pg directory permissions should not be changed.

Version

pmm-server and client 2.36. OKD 4.11

Steps to reproduce

No response

Relevant logs

I had checked /srv permissions and I found that:


drwxrwsr-x. 13 root     pmm   4096 Apr  6 03:30 .
dr-xr-xr-x.  1 root     root    62 Apr 12 08:28 ..
drwxrwsr-x.  3 root     pmm   4096 Apr  6 03:29 alerting
drwxrwsr-x.  4 pmm      pmm   4096 Apr  6 03:29 alertmanager
drwxrwsr-x.  2 root     pmm   4096 Apr  6 03:30 backup
drwxrwsr-x. 13 root     pmm   4096 Apr 12 08:28 clickhouse
drwxrwsr-x.  6 grafana  pmm   4096 Apr 12 08:28 grafana
drwxrwsr-x.  2 pmm      pmm   4096 Apr 12 08:23 logs
drwxrws---.  2 root     pmm  16384 Apr  6 03:29 lost+found
drwxrwsr-x.  2 root     pmm   4096 Apr  6 03:29 nginx
-rw-rw-r--.  1 root     pmm      7 Apr  6 03:29 pmm-distribution
drwxrws---. 20 postgres pmm   4096 Apr 12 00:00 postgres14
drwxrwsr-x.  3 pmm      pmm   4096 Apr  6 03:29 prometheus
drwxrwsr-x.  3 pmm      pmm   4096 Apr  6 03:29 victoriametrics

Code of Conduct

cdmikechen commented 1 year ago

I also tried to change the pg directory permissions and rename it, and found that the permissions were still changed after restarting pod. Did a script or program force the folder permissions to be updated?

Before restart:

drwxrwsr-x.  3 root     pmm  4096 Apr  6 03:29 alerting
drwxrwsr-x.  4 pmm      pmm  4096 Apr  6 03:29 alertmanager
drwxrwsr-x.  2 root     pmm  4096 Apr  6 03:30 backup
drwxrwsr-x. 13 root     pmm  4096 Apr 12 08:57 clickhouse
drwxrwsr-x.  6 grafana  pmm  4096 Apr 12 08:57 grafana
drwxrwsr-x.  2 pmm      pmm  4096 Apr 12 08:23 logs
drwxrws---.  2 root     pmm 16384 Apr  6 03:29 lost+found
drwxrwsr-x.  2 root     pmm  4096 Apr  6 03:29 nginx
-rw-rw-r--.  1 root     pmm     7 Apr  6 03:29 pmm-distribution
drwx--S---. 20 postgres pmm  4096 Apr 12 00:00 postgres14-bak
drwxrwsr-x.  3 pmm      pmm  4096 Apr  6 03:29 prometheus
drwxrwsr-x.  3 pmm      pmm  4096 Apr  6 03:29 victoriametrics

After restart pod:

drwxrwsr-x.  3 root     pmm  4096 Apr  6 03:29 alerting
drwxrwsr-x.  4 pmm      pmm  4096 Apr  6 03:29 alertmanager
drwxrwsr-x.  2 root     pmm  4096 Apr  6 03:30 backup
drwxrwsr-x. 13 root     pmm  4096 Apr 12 09:05 clickhouse
drwxrwsr-x.  6 grafana  pmm  4096 Apr 12 09:05 grafana
drwxrwsr-x.  2 pmm      pmm  4096 Apr 12 08:23 logs
drwxrws---.  2 root     pmm 16384 Apr  6 03:29 lost+found
drwxrwsr-x.  2 root     pmm  4096 Apr  6 03:29 nginx
-rw-rw-r--.  1 root     pmm     7 Apr  6 03:29 pmm-distribution
drwxrws---. 20 postgres pmm  4096 Apr 12 00:00 postgres14-bak
drwxrwsr-x.  3 pmm      pmm  4096 Apr  6 03:29 prometheus
drwxrwsr-x.  3 pmm      pmm  4096 Apr  6 03:29 victoriametrics
denisok commented 1 year ago

Hi @cdmikechen, what version of a helm chart (pmm chart version) and repo do you use for PMM?

There are couple of things that could change those permissions - init container, storage provisioner or some update procedure.

As you said you use OKD - we don't officially support OpenShift yet as PMM requires root in the container.

Why pod was restarted? Did you run some update procedure?

Thanks, Denys

cdmikechen commented 1 year ago

@denisok The reason for killing pod was because I wanted to test if the pmm-server would work after a restart. I have solved this issue so far, the problem occurred because I added a fsgroup to the container. After removing it, pmm-server has started normally.

However, there is another problem: pmm-client will fail several times after every percona pod restart, and the pod will only work after a few error restarts. I don't understand what the reason for this is.

denisok commented 1 year ago

@cdmikechen

what version of a helm chart (pmm chart version) and repo do you use for PMM?

What logs and events shows for that pod and all containers in it?

cdmikechen commented 1 year ago

@denisok The helm chart version is 1.2.1. Here is the pmm-client logs:

INFO[2023-04-21T17:37:15.410+08:00] Run setup: true Sidecar mode: true            component=entrypoint
INFO[2023-04-21T17:37:15.410+08:00] Starting pmm-agent for liveness probe...      component=entrypoint
INFO[2023-04-21T17:37:15.410+08:00] Starting 'pmm-admin setup'...                 component=entrypoint
INFO[2023-04-21T17:37:15.552+08:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml.  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/node_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/rds_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/azure_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/vmagent  component=main
INFO[2023-04-21T17:37:15.553+08:00] Runner capacity set to 32.                    component=runner
INFO[2023-04-21T17:37:15.553+08:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml.  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/node_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/rds_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/azure_exporter  component=main
INFO[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/vmagent  component=main
INFO[2023-04-21T17:37:15.554+08:00] Window check connection time is 1.00 hour(s) 
INFO[2023-04-21T17:37:15.554+08:00] Starting...                                   component=client
ERRO[2023-04-21T17:37:15.554+08:00] Agent ID is not provided, halting.            component=client
INFO[2023-04-21T17:37:15.554+08:00] Starting local API server on http://0.0.0.0:7777/ ...  component=local-server/JSON
INFO[2023-04-21T17:37:15.556+08:00] Started.                                      component=local-server/JSON
INFO[2023-04-21T17:37:15.559+08:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml.  component=setup
INFO[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/node_exporter  component=setup
INFO[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter  component=setup
INFO[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter  component=setup
INFO[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter  component=setup
INFO[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter  component=setup
INFO[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/rds_exporter  component=setup
INFO[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/azure_exporter  component=setup
INFO[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/vmagent  component=setup
Checking local pmm-agent status...
pmm-agent is running.
Registering pmm-agent on PMM Server...
Registered.
Configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml updated.
Reloading pmm-agent configuration...
INFO[2023-04-21T17:37:15.887+08:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml.  component=local-server
INFO[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/node_exporter  component=local-server
INFO[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter  component=local-server
INFO[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter  component=local-server
INFO[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter  component=local-server
INFO[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter  component=local-server
INFO[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/rds_exporter  component=local-server
INFO[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/azure_exporter  component=local-server
INFO[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/vmagent  component=local-server
INFO[2023-04-21T17:37:15.888+08:00] Stopped.                                      component=local-server/JSON
INFO[2023-04-21T17:37:15.890+08:00] Done.                                         component=local-server
INFO[2023-04-21T17:37:15.890+08:00] Done.                                         component=supervisor
INFO[2023-04-21T17:37:15.890+08:00] Done.                                         component=main
Checking local pmm-agent status...
pmm-agent is not running.
INFO[2023-04-21T17:37:20.901+08:00] 'pmm-admin setup' exited with 0               component=entrypoint
INFO[2023-04-21T17:37:20.901+08:00] Stopping pmm-agent...                         component=entrypoint
FATA[2023-04-21T17:37:20.901+08:00] Failed to kill pmm-agent: os: process already finished  component=entrypoint
chadr123 commented 1 year ago

Hi. I think the pmm-client failing is much similar to this issue that I've created: https://jira.percona.com/browse/PMM-11893

davidmnoriega commented 1 year ago

I ran into the same issue with pmm-server using the helm chart version 1.2.5 and pmm-server 2.39.0. I did not set any security context in the helm chart values and the deployed sts had them empty.

I then learned our k8s cluster applies a default security context at both the pod and container level, here is the pod security context:

securityContext:
  fsGroup: 1
  seccompProfile:
    type: RuntimeDefault
  supplementalGroups:
    - 1

After a restart, this is what /srv permissions would look like:

[root@ads-pmm-stage-0-0 opt] # ls -alh /srv
total 72K
drwxrwsr-x. 13 root     bin  4.0K Aug 22 04:47 .
dr-xr-xr-x.  1 root     root 4.0K Aug 22 04:54 ..
drwxrwsr-x.  3 root     bin  4.0K Aug 22 04:47 alerting
drwxrwsr-x.  4 pmm      bin  4.0K Aug 22 04:47 alertmanager
drwxrwsr-x.  2 root     bin  4.0K Aug 22 04:47 backup
drwxrwsr-x. 13 root     bin  4.0K Aug 22 04:54 clickhouse
drwxrwsr-x.  6 grafana  bin  4.0K Aug 22 04:54 grafana
drwxrwsr-x.  2 pmm      bin  4.0K Aug 22 04:46 logs
drwxrws---.  2 root     bin   16K Aug 22 04:46 lost+found
drwxrwsr-x.  2 root     bin  4.0K Aug 22 04:46 nginx
-rw-rw-r--.  1 root     bin     7 Aug 22 04:46 pmm-distribution
drwxrws---. 20 postgres bin  4.0K Aug 22 04:52 postgres14
drwxrwsr-x.  3 pmm      bin  4.0K Aug 22 04:46 prometheus
drwxrwsr-x.  3 pmm      bin  4.0K Aug 22 04:46 victoriametrics

After some trial and error, I found this helm chart value allowed pmm to survive restarts

podSecurityContext:
  fsGroupChangePolicy: OnRootMismatch

The effective pod security context:

securityContext:
  fsGroup: 1
  fsGroupChangePolicy: OnRootMismatch
  seccompProfile:
    type: RuntimeDefault
  supplementalGroups:
  - 1

Starting fresh, this is what /srv looked like on first boot:

[root@ads-pmm-stage-0-0 opt] # ls -alh /srv
total 72K
drwxrwsr-x. 13 root     bin      4.0K Aug 22 19:48 .
dr-xr-xr-x.  1 root     root     4.0K Aug 22 19:47 ..
drwxr-sr-x.  3 root     bin      4.0K Aug 22 19:47 alerting
drwxrwxr-x.  4 pmm      pmm      4.0K Aug 22 19:47 alertmanager
drwxr-sr-x.  2 root     bin      4.0K Aug 22 19:48 backup
drwxr-sr-x. 13 root     bin      4.0K Aug 22 19:47 clickhouse
drwxr-sr-x.  6 grafana  render   4.0K Aug 22 19:48 grafana
drwxr-sr-x.  2 pmm      pmm      4.0K Aug 22 19:47 logs
drwxrws---.  2 root     bin       16K Aug 22 19:47 lost+found
drwxr-sr-x.  2 root     bin      4.0K Aug 22 19:47 nginx
-rw-r--r--.  1 root     bin         7 Aug 22 19:47 pmm-distribution
drwx------. 20 postgres postgres 4.0K Aug 22 19:47 postgres14
drwxr-sr-x.  3 pmm      pmm      4.0K Aug 22 19:47 prometheus
drwxrwxr-x.  3 pmm      pmm      4.0K Aug 22 19:47 victoriametrics

and reboot:

[root@ads-pmm-stage-0-0 opt] # ls -alh /srv
total 72K
drwxrwsr-x. 13 root     bin      4.0K Aug 22 19:48 .
dr-xr-xr-x.  1 root     root     4.0K Aug 22 19:53 ..
drwxr-sr-x.  3 root     bin      4.0K Aug 22 19:47 alerting
drwxrwxr-x.  4 pmm      pmm      4.0K Aug 22 19:47 alertmanager
drwxr-sr-x.  2 root     bin      4.0K Aug 22 19:48 backup
drwxr-sr-x. 13 root     bin      4.0K Aug 22 19:54 clickhouse
drwxr-sr-x.  6 grafana  render   4.0K Aug 22 19:53 grafana
drwxr-sr-x.  2 pmm      pmm      4.0K Aug 22 19:47 logs
drwxrws---.  2 root     bin       16K Aug 22 19:47 lost+found
drwxr-sr-x.  2 root     bin      4.0K Aug 22 19:47 nginx
-rw-r--r--.  1 root     bin         7 Aug 22 19:47 pmm-distribution
drwx------. 20 postgres postgres 4.0K Aug 22 19:53 postgres14
drwxr-sr-x.  3 pmm      pmm      4.0K Aug 22 19:47 prometheus
drwxrwxr-x.  3 pmm      pmm      4.0K Aug 22 19:47 victoriametrics

I hope there are plans to support running without root