Messages hang in pending state when database connection fails

Ronnnn commented 4 months ago

Describe the bug

In my 3.3.4 version of Postal I have this problem that the database sometimes loses connection.

Jul 10 08:55:44 post mariadbd[626]: 2024-07-10 6:55:44+0000 482 [Warning] Aborted connection 482 to db: 'unconnected' user: 'postal' host: 'localhost' (Got timeout reading communication packets)

That needs to be fixed on my server... BUT Postal is not able to handle this. The message which the worker is trying to send hangs in pending state after this:

2024-07-10 06:55:45 +0000 DEBUG  acquired task role by renewing it component=worker thread=tasks
2024-07-10 06:55:45 +0000 INFO   running task component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:55:45 +0000 INFO   scheduling task to next run at 2024-07-10 06:56:45 UTC component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:56:45 +0000 DEBUG  acquired task role by renewing it component=worker thread=tasks
2024-07-10 06:56:45 +0000 INFO   running task component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:56:45 +0000 INFO   scheduling task to next run at 2024-07-10 06:57:45 UTC component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:57:03 +0000 INFO   starting message unqueue component=worker thread=work0 original_queued_message=1134326
2024-07-10 06:57:03 +0000 INFO   queue latency is 2.470646619796753s component=worker thread=work0 original_queued_message=1134326
2024-07-10 06:57:03 +0000 ERROR  Mysql2::Error::ConnectionError (MySQL server has gone away) component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `_query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `block in query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `handle_interrupt' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:326:in `query_on_connection' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:318:in `block in query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/connection_pool.rb:20:in `use' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:345:in `with_mysql' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:317:in `query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:177:in `select' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/message.rb:12:in `find_one' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:47:in `message' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/app/models/concerns/has_message.rb:12:in `message' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:42:in `check_message_exists' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:16:in `block (2 levels) in process' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer/base.rb:38:in `catch_stops' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:14:in `block in process' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/group_set.rb:18:in `call_without_id' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/logger.rb:72:in `tagged' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:11:in `process' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer.rb:9:in `process' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/jobs/process_queued_messages_job.rb:67:in `block in process_messages' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/activerecord-7.0.8.1/lib/active_record/relation/delegation.rb:88:in `each' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/activerecord-7.0.8.1/lib/active_record/relation/delegation.rb:88:in `each' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/jobs/process_queued_messages_job.rb:65:in `process_messages' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/jobs/process_queued_messages_job.rb:14:in `call' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:164:in `block (4 levels) in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/lib/ruby/3.2.0/benchmark.rb:311:in `realtime' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:164:in `block (3 levels) in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:285:in `capture_errors' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:161:in `block (2 levels) in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:160:in `each' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:160:in `block in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/activerecord-7.0.8.1/lib/active_record/connection_adapters/abstract/connection_pool.rb:215:in `with_connection' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:159:in `work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:142:in `block (3 levels) in start_work_thread' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:141:in `loop' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:141:in `block (2 levels) in start_work_thread' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/group_set.rb:18:in `call_without_id' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/logger.rb:72:in `tagged' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:139:in `block in start_work_thread' component=worker thread=work0
2024-07-10 06:57:06 +0000 INFO   starting message unqueue component=worker thread=work1 original_queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   queue latency is 4.297773838043213s component=worker thread=work1 original_queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   found 16 associated messages to process at the same time batch_key=outgoing-gmail.com component=worker thread=work1 original_queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   parsing message content as it hasn't been parsed before component=worker thread=work1 original_queued_message=1134327 queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   Connected to <CUT> log_id=Y75FMTZH component=worker thread=work1 original_queued_message=1134327 queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   Sending message 1::4766770 to <CUT> log_id=Y75FMTZH component=worker thread=work1 original_queued_message=1134327 queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   Accepted by <CUT> log_id=Y75FMTZH component=worker thread=work1 original_queued_message=1134327 queued_message=1134327

Now the message hangs in pending state because it is still locked by the failing worker process:

select id,message_id,locked_by,locked_at,retry_after,created_at,updated_at from queued_messages where message_id=4766769;
+---------+------------+--------------------------------------------+----------------------------+-------------+----------------------------+----------------------------+
| id      | message_id | locked_by                                  | locked_at                  | retry_after | created_at                 | updated_at                 |
+---------+------------+--------------------------------------------+----------------------------+-------------+----------------------------+----------------------------+
| 1134326 |    4766769 | host:mail pid:9 thread:12 22c1cddf97171ce2 | 2024-07-10 06:57:03.968458 | NULL        | 2024-07-10 06:57:01.510895 | 2024-07-10 06:57:01.510895 |
+---------+------------+--------------------------------------------+----------------------------+-------------+----------------------------+----------------------------+
1 row in set (0.001 sec)

The only way to get it through now is setting locked_at and locked_by fields to null.

Expected behaviour

The message should be cleaned properly by the worker so it can be picked up by another process the next time the database is back.

Environment details

OS: Ubuntu 22.04.4 LTS
Version 3.3.4

willpower232 commented 4 months ago

I think this has been a problem since forever and I am not sure there is an easy solution. This is probably easier to solve now the queue is in the database as opposed to RabbitMQ.

Many people find they need to run multiple workers to consume their queue in a timely manner so process identification is tricky as the workers are not able to communicate with each other and I do not think Docker would be able to tell each worker about every other worker.

One solution could be a configurable lock retry setting so if locked_at was more than say an hour ago, the message is picked up again.

Ronnnn commented 4 months ago

Thank you again for the quick responses. Really appreciate it. Might it be an idea to give the database layer some kind of retry-mechanism?

https://github.com/postalserver/postal/blob/da90e75036c27482699921613d838f4058a100e7/lib/postal/message_db/database.rb

Disclaimer: untested code not written by a human ;p

def query(query)
  retries = 0
  begin
    with_mysql do |mysql|
      query_on_connection(mysql, query)
    end
  rescue Mysql2::Error::ConnectionError => e
    if retries < 1 # Limit the number of retries to 1
      retries += 1
      logger.warn "Connection error on query, retrying... (Attempt #{retries + 1})"
      retry
    else
      logger.error "Repeated connection errors on query, giving up. Error: #{e.message}"
      raise
    end
  end
end

notz commented 4 months ago

Seems to be a duplicate of #3023

memcorrupt commented 2 months ago

+1 - this issue also happens to me

postal logs:

postal-web-1  | 2024-09-26 20:52:15 +0000 ERROR  POST /api/v1/send/message (500) event=request transaction=[snip] controller=LegacyAPI::SendController action=message format=html method=POST path=/api/v1/send/message request_id=[snip] ip_address=[snip] status=500 db_runtime=6.752912521362305 exception_class=Mysql2::Error::ConnectionError exception_message=MySQL server has gone away exception_backtrace=/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `_query'\n/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `block in query'\n/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `handle_interrupt'\n/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `query'

mariadb logs:

[Warning] Aborted connection 34409 to db: 'unconnected' user: 'x' host: 'localhost' (Got timeout reading communication packets)

postalserver / postal