postalserver / postal

📮 A fully featured open source mail delivery platform for incoming & outgoing e-mail
https://postalserver.io
MIT License
14.81k stars 1.05k forks source link

When worker crashes (i.e. because rabbitmq is down), it stays down and does not recover automatically #2638

Closed 007hacky007 closed 1 year ago

007hacky007 commented 1 year ago

Describe the bug

When postal worker crashes, it does not recover automatically - i.e. try to restart automatically. It just stays down and so the mail queue does not get processed.

To Reproduce

  1. postal start
  2. kill rabbitmq
  3. postal worker crashes
  4. See postal logs

Expected behaviour

Individual postal containers would automatically restart on failure.

Logs

postal-worker-1  | [9] [2023-10-05T22:05:47.245] INFO -- : Worker running with 4 threads
postal-worker-1    | E, [2023-10-05T22:05:47.259286 #9] ERROR -- #<Bunny::Session:0x55f4d04a4910 postal@127.0.0.1:5672, vhost=postal, addresses=[127.0.0.1:5672]>: Got an exception when receiving data: Connection reset by peer (Errno::ECONNRESET)
postal-worker-1    | /usr/local/lib/ruby/2.6.0/socket.rb:452:in `__read_nonblock': Connection reset by peer (Errno::ECONNRESET)
postal-worker-1    |    from /usr/local/lib/ruby/2.6.0/socket.rb:452:in `read_nonblock'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/cruby/socket.rb:58:in `block in read_fully'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/cruby/socket.rb:57:in `loop'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/cruby/socket.rb:57:in `read_fully'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/transport.rb:239:in `read_fully'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/transport.rb:261:in `read_next_frame'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/session.rb:1166:in `init_connection'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/session.rb:319:in `start'
postal-worker-1    |    from /opt/postal/app/lib/postal/rabbit_mq.rb:26:in `create_connection'
postal-worker-1    |    from /opt/postal/app/lib/postal/rabbit_mq.rb:31:in `create_channel'
postal-worker-1    |    from /opt/postal/app/lib/postal/worker.rb:195:in `job_channel'
postal-worker-1    |    from /opt/postal/app/lib/postal/worker.rb:17:in `work'
postal-cron-1    | [9] [2023-10-05T22:05:45.455] INFO -- : Starting clock for 4 events: [ every-1-minutes every-hour every-hour every-day ]
postal-cron-1      | [9] [2023-10-05T22:05:45.458] INFO -- : Triggering 'every-1-minutes'
postal-cron-1      | [9] [2023-10-05T22:05:45.476] ERROR -- : Empty response received from the server.
postal-worker-1    |    from script/worker.rb:3:in `<main>'

Environment details

willpower232 commented 1 year ago

can you have a look at #2136 and apply https://github.com/postalserver/install/pull/3 to your files to see if that resolves your problem as well?

007hacky007 commented 1 year ago

Thank you @willpower232 for you response. I've tried adding restart: unless-stopped and it seems to restart container few times and then it gave up and let container down. Anyway I believe such restart policy should definitely be in place and https://github.com/postalserver/install/pull/3 should be merged. It does not resolve my issue, but it definitely makes sense to automatically restart the container on non-zero exit code.

To further explain my setup.

I'm starting postal after the boot with systemd service:

postal :: /etc/systemd/system » cat postal.service
[Unit]
Description=Postal Service
Requires=docker.service
After=docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/postal start
ExecStop=/usr/bin/postal stop
; TimeoutStartSec is hotfix to let rabbitmq container start before trying to start postal 
TimeoutStartSec=120

[Install]
WantedBy=multi-user.target

The issue is, postal starts before rabbitmq container is ready and so postal worker just crashes, because rabbitmq's port 5672 is not yet ready.

postal-worker-1    | [9] [2023-10-06T08:21:12.974] INFO -- : Worker running with 4 threads
postal-worker-1    | E, [2023-10-06T08:21:12.976458 #9] ERROR -- #<Bunny::Session:0x55f777465b58 postal@127.0.0.1:5672, vhost=postal, addresses=[127.0.0.1:5672]>: Got an exception when receiving data: Connection reset by peer (Errno::ECONNRESET)
postal-worker-1    | /usr/local/lib/ruby/2.6.0/socket.rb:452:in `__read_nonblock': Connection reset by peer (Errno::ECONNRESET)
postal-worker-1    |    from /usr/local/lib/ruby/2.6.0/socket.rb:452:in `read_nonblock'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/cruby/socket.rb:58:in `block in read_fully'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/cruby/socket.rb:57:in `loop'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/cruby/socket.rb:57:in `read_fully'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/transport.rb:239:in `read_fully'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/transport.rb:261:in `read_next_frame'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/session.rb:1166:in `init_connection'
postal-worker-1    |    from /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/session.rb:319:in `start'
postal-worker-1    |    from /opt/postal/app/lib/postal/rabbit_mq.rb:26:in `create_connection'
postal-worker-1    |    from /opt/postal/app/lib/postal/rabbit_mq.rb:31:in `create_channel'
postal-worker-1    |    from /opt/postal/app/lib/postal/worker.rb:195:in `job_channel'
postal-worker-1    |    from /opt/postal/app/lib/postal/worker.rb:17:in `work'
postal-worker-1    |    from script/worker.rb:3:in `<main>'

My hotfix, for now is to delay postal start with TimeoutStartSec systemd's parameter.

007hacky007 commented 1 year ago

For the record - I've also tried adding

depends_on:
  - postal-rabbitmq

To the postal's docker-compose file but that did not do the trick either - the problem is, rabbitmq container is up, but not yet ready, so do dependency is fulfilled, but the rabbitmq's port is not available yet and so the postal's worker crashes few times until docker stops restarting it and then when the rabbitmq is finally ready, the postal-worker stays down.

I believe the best solution would be to handle bunny's connection exception directly in the /opt/postal/app/lib/postal/rabbit_mq.rb:26 and so worker would retry the connection process without crashing altogether.

007hacky007 commented 1 year ago

All right, disregard all my previous messages. I was editing /opt/postal/install/templates/docker-compose.yml instead of the /opt/postal/install/docker-compose.yml by mistake. Let this be lesson for all the others in the future.

restart: unless-stopped is the solution and https://github.com/postalserver/install/pull/3 should be merged.

willpower232 commented 1 year ago

to be fair, editing both could help you if the update rewrites the file, I can't remember.

thanks for confirming :pray:

bluepuma77 commented 1 year ago

Hi @willpower232, why do you close this if it's still an open bug? From my experience with other open source projects, bugs get closed when they are fixed, when pull requests are merged and new fixed releases are created.

Postalserver seems stale for 3 months now, no development ongoing. This basic improvement wasn't merged, it wasn't updated to docker compose, event though docker-compose v1 was deprecated June 30, 2023.

New users need to go through closed bugs and read comments to get a new stable system up and running - that seems strange to me.

willpower232 commented 1 year ago

This issue and multiple discussions appeared well after the PRs creation so keeping them open only serves to duplicate noise and distract those who are able to actually commit to the repository when they eventually re appear.