spantaleev / matrix-docker-ansible-deploy

🐳 Matrix (An open network for secure, decentralized communication) server setup using Ansible and Docker
GNU Affero General Public License v3.0
4.85k stars 1.04k forks source link

review service start order #1517

Closed HarHarLinks closed 5 months ago

HarHarLinks commented 2 years ago

this is kind of pedantic as it generally works out alright, but might potentially speed up upgrade-restarts.

starting the playbook, some services fail to start a couple times before eventually working. the most obvious one is nginx: it depends on all other containers to exist that are mentioned in its config, e.g.:

2022/01/08 17:11:01 [emerg] 1#1: host not found in upstream "matrix-prometheus-postgres-exporter" in /etc/nginx/conf.d/matrix-grafana.conf:63

Other services depend on synapse and might start it earlier than the playbook order implies.

Services should obtain more After=, Requires=, etc config conditionally based on what is enabled and what isnt.

Is there a tool that can graphically render service relations from a bunch of .service files? Would be helpful.

spantaleev commented 2 years ago

We specify some wanted services in matrix_nginx_proxy_systemd_wanted_services_list (which go to Wanted= in the matrix-nginx-proxy.service file) in group_vars/matrix_servers: https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/4e4fb98a65474fd058c61a838db6ac312a09e7df/group_vars/matrix_servers#L1462-L1471

We might define additional wanted services there.


We have matrix_nginx_proxy_systemd_required_services_list as well, but we avoid hardcoding such hard-dependencies, because.. If matrix-nginx-proxy lists some random service (say matrix-client-element) as required for matrix-nginx-proxy to run, then restarting matrix-client-element for whatever reason (it being restarted manually or dying and getting restarted) has the negative side-effect of bringing down matrix-nginx-proxy as well.

If some random service (say matrix-dimension) fails to start for whatever reason (misconfiguration, container image bug, etc.), we also don't want that to prevent matrix-nginx-proxy from starting.

We don't want random services failing to start or being restarted to bring down everything else.

This is why we only use a "wanted services" list.


nginx insisting on resolving the DNS names for all defined upstreams when it starts is kind of bad, but.. from what I remember, there's no way around it, except for using their "nginx plus" offering.

spantaleev commented 2 years ago

Adding this to the matrix_nginx_proxy_systemd_wanted_services_list list in group_vars/matrix_servers may solve this particular problem:

matrix_nginx_proxy_systemd_wanted_services_list: |
  {{
    ['matrix-' + matrix_homeserver_implementation + '.service']
    +
    (['matrix-corporal.service'] if matrix_corporal_enabled else [])
    +
    (['matrix-ma1sd.service'] if matrix_ma1sd_enabled else [])
    +
    (['matrix-client-element.service'] if matrix_client_element_enabled else [])
+    +
+    (['matrix-prometheus-postgres-exporter.service'] if matrix_prometheus_postgres_exporter_enabled else [])
  }}

We could extend this list with various other services. PRs are welcome ;)

HarHarLinks commented 2 years ago

the ok status of stopping and starting all services seem to imply some existing relations that start/stop other services before the playbook does, however this is harmless unless it leads to an ordering violation elsewhere.

TASK [matrix-common-after : Ensure Matrix services are stopped] ******
changed: [matrix.matrix_domain] => (item=matrix-mailer.service)
changed: [matrix.matrix_domain] => (item=matrix-postgres.service)
changed: [matrix.matrix_domain] => (item=matrix-redis)
ok: [matrix.matrix_domain] => (item=matrix-appservice-webhooks.service)
ok: [matrix.matrix_domain] => (item=matrix-mautrix-signal.service)
changed: [matrix.matrix_domain] => (item=matrix-mautrix-signal-daemon.service)
ok: [matrix.matrix_domain] => (item=matrix-mx-puppet-discord.service)
ok: [matrix.matrix_domain] => (item=matrix-mx-puppet-steam.service)
ok: [matrix.matrix_domain] => (item=matrix-mx-puppet-slack.service)
ok: [matrix.matrix_domain] => (item=matrix-synapse.service)
changed: [matrix.matrix_domain] => (item=matrix-synapse-worker-generic_worker-18111.service)
changed: [matrix.matrix_domain] => (item=matrix-synapse-worker-federation_sender-0.service)
changed: [matrix.matrix_domain] => (item=matrix-synapse-worker-pusher-0.service)
changed: [matrix.matrix_domain] => (item=matrix-synapse-worker-appservice-0.service)
changed: [matrix.matrix_domain] => (item=matrix-synapse-worker-media_repository-18551.service)
changed: [matrix.matrix_domain] => (item=matrix-synapse-worker-frontend_proxy-18771.service)
changed: [matrix.matrix_domain] => (item=matrix-synapse-admin.service)
changed: [matrix.matrix_domain] => (item=matrix-prometheus-node-exporter.service)
ok: [matrix.matrix_domain] => (item=matrix-registration.service)
changed: [matrix.matrix_domain] => (item=matrix-jitsi-web.service)
changed: [matrix.matrix_domain] => (item=matrix-jitsi-prosody.service)
ok: [matrix.matrix_domain] => (item=matrix-jitsi-jicofo.service)
ok: [matrix.matrix_domain] => (item=matrix-jitsi-jvb.service)
ok: [matrix.matrix_domain] => (item=matrix-dimension.service)
ok: [matrix.matrix_domain] => (item=matrix-etherpad.service)
changed: [matrix.matrix_domain] => (item=matrix-nginx-proxy.service)
changed: [matrix.matrix_domain] => (item=matrix-ssl-lets-encrypt-certificates-renew.timer)
changed: [matrix.matrix_domain] => (item=matrix-ssl-nginx-proxy-reload.timer)
changed: [matrix.matrix_domain] => (item=matrix-coturn.service)
changed: [matrix.matrix_domain] => (item=matrix-coturn-reload.timer)
ok: [matrix.matrix_domain] => (item=matrix-prometheus-postgres-exporter.service)

TASK [matrix-common-after : Ensure Matrix services are started] ******
changed: [matrix.matrix_domain] => (item=matrix-mailer.service)
changed: [matrix.matrix_domain] => (item=matrix-postgres.service)
changed: [matrix.matrix_domain] => (item=matrix-redis)
changed: [matrix.matrix_domain] => (item=matrix-appservice-webhooks.service)
changed: [matrix.matrix_domain] => (item=matrix-mautrix-signal.service)
ok: [matrix.matrix_domain] => (item=matrix-mautrix-signal-daemon.service)
changed: [matrix.matrix_domain] => (item=matrix-mx-puppet-discord.service)
changed: [matrix.matrix_domain] => (item=matrix-mx-puppet-steam.service)
changed: [matrix.matrix_domain] => (item=matrix-mx-puppet-slack.service)
ok: [matrix.matrix_domain] => (item=matrix-synapse.service)
ok: [matrix.matrix_domain] => (item=matrix-synapse-worker-generic_worker-18111.service)
ok: [matrix.matrix_domain] => (item=matrix-synapse-worker-federation_sender-0.service)
ok: [matrix.matrix_domain] => (item=matrix-synapse-worker-pusher-0.service)
ok: [matrix.matrix_domain] => (item=matrix-synapse-worker-appservice-0.service)
ok: [matrix.matrix_domain] => (item=matrix-synapse-worker-media_repository-18551.service)
ok: [matrix.matrix_domain] => (item=matrix-synapse-worker-frontend_proxy-18771.service)
changed: [matrix.matrix_domain] => (item=matrix-synapse-admin.service)
changed: [matrix.matrix_domain] => (item=matrix-prometheus-node-exporter.service)
changed: [matrix.matrix_domain] => (item=matrix-registration.service)
changed: [matrix.matrix_domain] => (item=matrix-jitsi-web.service)
changed: [matrix.matrix_domain] => (item=matrix-jitsi-prosody.service)
changed: [matrix.matrix_domain] => (item=matrix-jitsi-jicofo.service)
changed: [matrix.matrix_domain] => (item=matrix-jitsi-jvb.service)
changed: [matrix.matrix_domain] => (item=matrix-dimension.service)
changed: [matrix.matrix_domain] => (item=matrix-etherpad.service)
ok: [matrix.matrix_domain] => (item=matrix-nginx-proxy.service)
changed: [matrix.matrix_domain] => (item=matrix-ssl-lets-encrypt-certificates-renew.timer)
changed: [matrix.matrix_domain] => (item=matrix-ssl-nginx-proxy-reload.timer)
ok: [matrix.matrix_domain] => (item=matrix-coturn.service)
changed: [matrix.matrix_domain] => (item=matrix-coturn-reload.timer)
changed: [matrix.matrix_domain] => (item=matrix-prometheus-postgres-exporter.service)

matrix-appservice-webhooks and matrix-dimension both require matrix-nginx-proxy to be reachable before they should start.

spantaleev commented 2 years ago

matrix-nginx-proxy is listed as a dependency in matrix_appservice_webhooks_systemd_required_services_list in group_vars/matrix_servers.

Likewise for matrix_dimension_systemd_required_services_list.

The fact that the playbook tries to start matrix-appservice-webhooks and matrix-dimension before matrix-nginx-proxy may be suboptimal, but is ultimately not a problem. systemd .service files define dependencies correctly, so starting any one of these services will provoke matrix-nginx-proxy to get started. Starting matrix-nginx-proxy then becomes a no-op (you can see the ok mark there, instead of changed).

More-so, these dependencies are important for when services are started/restarted by other means (system reboot, manual systemd service restart, service failure, etc.). --tags=start is not the only way to (re-)start services. Having correct dependencies in the systemd service files is more important than what --tags=start does.


That said, we may reorder roles in setup.yml to improve the --tags=start situation. It probably needs to be done with care though, because certain roles (some bridges, at least) inject configuration into matrix-nginx-proxy during runtime.

Example: https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/5a8b17c1df268e7fc206ed72bc8c7a7d56626c1b/roles/matrix-bridge-mautrix-telegram/tasks/init.yml#L25-L60


Similarly, some services also inject stuff into matrix-synapse variables.

Example: https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/5a8b17c1df268e7fc206ed72bc8c7a7d56626c1b/roles/matrix-bridge-mautrix-telegram/tasks/init.yml#L12-L23


So.. re-ordering roles is probably not ideal.

Alternatively, each role can inject itself into matrix_systemd_services_list with not just a service name, but also some priority. We can then sort them and stop/start them in a smarter way. This complicates things though, and for little benefit.

Still, if you're up for redoing all roles in such a way, PRs are welcome ;)

HarHarLinks commented 2 years ago

because certain roles (some bridges, at least) inject configuration into matrix-nginx-proxy during runtime.

Example:

https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/5a8b17c1df268e7fc206ed72bc8c7a7d56626c1b/roles/matrix-bridge-mautrix-telegram/tasks/init.yml#L25-L60

I'm sorry, either I don't understand what you're saying, or you're mixing ansible runtime and service runtime. This would be run during ansible and template the config and service files. However starting one service does not seem to modify the configuration of other services? Or I don't see how.

Still, if you're up for redoing all roles in such a way, PRs are welcome ;)

I know, I know... I don't see that kind of time at my hands currently, but it seems the correct although as we discussed low priority thing to do.

To outline a concrete issue: Currently nginx depends on prometheus-postgres-exporter as we have seen, and other services depend on nginx. Since the exporter seems to be the last thing to start via the playbook, and nginx starts much earlier via the playbook and probably even earlier as a dependency, nginx will keep failing since it doesn't Wants= the exporter. As a result all containers that in turn depend on nginx, might be in a restart loop, such as matrix-appservice-webhooks, matrix-dimension, and probably more appservices. While it works out in my case after waiting a couple minutes, this can be avoided if done cleanly.

That's what I wanted to note down above in case someone is going to tackle this issue at some point.

spantaleev commented 2 years ago

I've updated the wanted services list for matrix-nginx-proxy and matrix-grafana in 0fb881deb578, which hopefully improves the situation.

I'm still not sure why your error says:

2022/01/08 17:11:01 [emerg] 1#1: host not found in upstream "matrix-prometheus-postgres-exporter" in /etc/nginx/conf.d/matrix-grafana.conf:63

I don't see why matrix-nginx-proxy's matrix-grafana.conf file would point to matrix-prometheus-postgres-exporter. Looking at the template (roles/matrix-nginx-proxy/templates/nginx/conf.d/matrix-grafana.conf.j2), it should only be pointing to matrix-grafana.

HarHarLinks commented 2 years ago

indeed you're right! I'm using external metrics, and thus have added

matrix_nginx_proxy_proxy_grafana_additional_server_configuration_blocks:
  - 'location /node-exporter/ {
  resolver 127.0.0.11 valid=5s;
  proxy_pass http://matrix-prometheus-node-exporter:9100/;
  auth_basic "protected";
  auth_basic_user_file /nginx-data/matrix-synapse-metrics-htpasswd;
  }'
  - 'location /postgres-exporter/ {
  resolver 127.0.0.11 valid=5s;
  proxy_pass http://matrix-prometheus-postgres-exporter:9187/;
  auth_basic "protected";
  auth_basic_user_file /nginx-data/matrix-synapse-metrics-htpasswd;
  }'

I realize now my particular error is indeed nonstandard/custom config, but at the same time I suppose it could make sense to integrate into the playbook.

spantaleev commented 2 years ago

You can improve your situation by redefining matrix_nginx_proxy_systemd_wanted_services_list.

Unfortunately, you can't easily add stuff to the list. We should probably introduce an additional wanted services variable (e.g. matrix_nginx_proxy_systemd_additional_wanted_services_list), which gets merged with the other one.

You can similarly use matrix_nginx_proxy_systemd_required_services_list, if necessary, but it suffers from the same problem -- you'd need to completely redefine the variable.

skepticalwaves commented 2 years ago

Another one with ordering issues, https://github.com/spantaleev/matrix-docker-ansible-deploy/issues/1253

skepticalwaves commented 2 years ago

Another service start order issue, etherpad has permission issues in the current ordering, which requires its service be restarted before any pads will load.

HarHarLinks commented 2 years ago

Another service start order issue, etherpad has permission issues in the current ordering, which requires its service be restarted before any pads will load.

I can't confirm this, in fact my last restarts when upgrading synapse to 1.51 and 1.52 have been 100% without timeouts. I use etherpad. Can you elaborate?

skepticalwaves commented 2 years ago

Can you elaborate?

On a --tags=setup-all,start, everything starts up without any obvious errors on the systemd side, however, attempting to access an embedded etherpad within a channel gets an etherpad permission error. Restarting the matrix-etherpad service and (after refreshing element to clear the etherpad attempted load) the embedded etherpad work.

davidmehren commented 2 years ago

I have similar issues with a worker-setup: matrix-common-after stops all services, then starts them again. nginx tries to start, but cannot find one of the workers, which has not managed to start yet ([emerg] 1#1: host not found in upstream "matrix-synapse-worker-generic_worker-18111:18111" in /etc/nginx/conf.d/matrix-synapse.conf:9). It then takes 30 seconds for systemd to restart nginx, so the playbook does not detect nginx running and fails. I fixed the issue by adding an override.conf for the nginx unit that sets RestartSec=5.

skepticalwaves commented 2 years ago

I'm still getting etherpad permission denied issues on --tags=setup-all,start execution, I need to systemctl restart matrix-etherpad.service to get it working.