Synapse workers leave "not-found failed" units after removal

rakshazi commented 2 years ago

Hello,

when you enable synapse workers and then disable them, even after removal by playbook workers' units' info still exists in systemd and you can see not-found failed state if you list services (just run systemctl without params to get the full list).

That's not a problem itself (even if you google for such behavior you'll find answers like "that's ok"), but when you run playbook again, it will fail with following errors (keep in mind that units were already removed and those services are just "ghosts" without any actual service):

failed: [your.host] (item={'key': 'matrix-synapse-worker-federation_sender-0.service', 'value': {'name': 'matrix-synapse-worker-federation_sender-0.service', 'state': 'stopped', 'status': 'failed', 'source': 'systemd'}}) => {"ansible_loop_var": "item", "changed": false, "item": {"key": "matrix-synapse-worker-federation_sender-0.service", "value": {"name": "matrix-synapse-worker-federation_sender-0.service", "source": "systemd", "state": "stopped", "status": "failed"}}, "msg": "Could not find the requested service matrix-synapse-worker-federation_sender-0.service: host"}
failed: [your.host] (item={'key': 'matrix-synapse-worker-frontend_proxy-18771.service', 'value': {'name': 'matrix-synapse-worker-frontend_proxy-18771.service', 'state': 'stopped', 'status': 'failed', 'source': 'systemd'}}) => {"ansible_loop_var": "item", "changed": false, "item": {"key": "matrix-synapse-worker-frontend_proxy-18771.service", "value": {"name": "matrix-synapse-worker-frontend_proxy-18771.service", "source": "systemd", "state": "stopped", "status": "failed"}}, "msg": "Could not find the requested service matrix-synapse-worker-frontend_proxy-18771.service: host"}

To fix that issue manually, you can run systemctl reset-failed, but I think how it can be automated.

My first idea was to add following task right under "Ensure any worker services are stopped" task in the roles/matrix-synapse/tasks/synapse/workers/setup_uninstall.yml:

- name: Ensure any worker services are properly removed
  command: "systemctl reset-failed {{ item.key }}" # note about command - reset-failed is available neither in ansible.builtin.service nor ansible.builtin.systemd
  when: ansible_service_mgr == "systemd" # because that's special hack required only for systemd
  with_dict: "{{ ansible_facts.services|default({})|dict2items|selectattr('key', 'match', 'matrix-synapse-worker-.+\\.service')|list|items2dict }}"

But it will not work on the first run (because units will not be marked as not-found failed at that moment), so it should be actually before the "Ensure any worker services are stopped" to fix the issue, but it will look weird.

Sorry, I don't have better idea how to implement it, so here is the solution (the code above) - I hope you will find a correct place to add it

spantaleev commented 2 years ago

Do you get these errors when you do --tags=start?

From what I remember, we are dynamically populating the list of services that need to be started as the playbook executes. If workers are disabled, there should never be a matrix-synapse-worker-* systemd service in the "services that should be started" list, regardless of whether such a systemd .service exists on the host or not.

Or is this some error during worker cleanup, not during --tags=start?

rakshazi commented 2 years ago

The error is part of workers cleanup process (task "Ensure any worker services are stopped"), so it's --tags setup-all, not during start

spantaleev commented 2 years ago

Thanks for reporting this! While working on the Dendrite support branch (https://github.com/spantaleev/matrix-docker-ansible-deploy/pull/818), I've encountered this same problem (matrix_synapse_enabled: false and it tries to uninstall Synapse along with all old workers, etc.)

Seems like running a bar systemctl doesn't output these failed units for me on CentOS 7.9.

The Ansible service_facts built-in module which collects the unit files actually performs systemctl list-units --no-pager --type service --all: https://github.com/ansible/ansible/blob/bc753c0518fd87c38fd3304f860fe55e00276303/lib/ansible/modules/service_facts.py#L247

I see a bunch of (not-found, inactive, dead) services when I do systemctl list-units --no-pager --type service --all | grep synapse.

Interestingly, neither systemctl reset-failed (to reset all), not systemctl reset-failed SERVICE_NAME change anything with regard to what I see for systemctl list-units --no-pager --type service --all | grep synapse.

Thankfully, ansible_facts.services contains a list of key/value things like this:

      matrix-synapse-worker-appservice-0.service:
        name: matrix-synapse-worker-appservice-0.service
        source: systemd
        state: stopped
        status: not-found

By excluding status != 'not-found' we can work around it, which is what I've done in 4625b34acca1.

Let's see how it goes with this fix. If anyone has a better idea, we can revisit this.

ofalvai commented 2 years ago

I just want to mention I ran into the same problem, even with the fix applied a few months ago. What helped me is running systemctl reset-failed SERVICE_NAME, which completely removed the service entry (I'm running Debian, not CentOS).

Maybe other people discovering this thread will find it useful.

spantaleev / matrix-docker-ansible-deploy

Synapse workers leave "not-found failed" units after removal #1461