Job crashing with Missing hash key: "limit" (KeyError)

wout commented 2 years ago

Please include some details:

Crystal version: 1.2.2 Mosquito Shard version: latest

One of my periodic jobs is crashing occasionally with the following error:

Unhandled exception: Missing hash key: "limit" (KeyError)
from /usr/share/crystal/src/hash.cr:1905:15 in 'run_next_task'
from lib/mosquito/src/mosquito/runner.cr:132:9 in 'run'
from lib/mosquito/src/mosquito/runner.cr:21:9 in 'start'
from src/app_worker.cr:11:1 in '__crystal_main'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /var/apps/dendrorithms.com/current/app_worker in '_start'
from ???

Most of the time, systemd is able to restart the service, but sometimes it's not. That's something I need to tweak but I'm curious to know why it's crashing. Does it ring a bell with you?

robacarp commented 2 years ago

@wout are you running the 1.0 rc-1 or 0.11?

wout commented 2 years ago

From shards.lock on the deployed app:

  mosquito:
    git: https://github.com/mosquito-cr/mosquito.git
    version: 0.11.2

robacarp commented 2 years ago

Yep, that checks out. Just confirming.

The hash it's trying to populate from redis is related to the v0 implementation of rate limiting. The stack trace is a bit garbled -- run_next_task is a method in mosquito's runner class, not Hash. But the bug is waaaaaaaay over in the Queue#rate_limited? method.

What's most likely happening is the version zero config hash is getting cleaned up and a later job can't find it. The easy fix is to move to 1.0-rc1, but I understand that might not be what you want to do. There may well be more bugs with 1.0 than you're finding on 0.11 -- but that's why it's in RC still.

I'm not sure what to do about the bug you're snagging on. This bug is part of why in v1.0 I pulled the rate limiting code out and re-wrote it as a module.

I think the bug is probably best addressed by running a Hash#merge with the base config hash when the config object is pulled from redis. Unfortunately the config metadata abstraction is a bit of a spaghetti monster, so that might be complicated to do.

robacarp commented 2 years ago

You said sometimes your systemd unit is unable to restart mosquito. What is causing that to fail? Are you able to share your unit definition?

wout commented 2 years ago

I'm open to trying 1.0-rc1. I've got a Honeybadger check-in on that job, so I'll know within 5 to 10 minutes if it's failing. Currently, it's not super reliable anyway.

The systemd file looks like this:

[Unit]
Description=Bookkeeper
After=nginx.service

[Service]
Type=simple
User=username
Environment="LUCKY_ENV=production"
...
WorkingDirectory=/path/to/app/current
ExecStart=/path/to/app/current/app_worker
Restart=on-failure
StartLimitBurst=2
StartLimitInterval=300

[Install]
WantedBy=multi-user.target

It's been better after adding:

StartLimitBurst=2
StartLimitInterval=300

But it still fails to start again from time to time (or I'm not patient enough to wait 300 seconds). Systemd is not something I configure very often, so I don't know its quirks.

wout commented 2 years ago

Just realised I also found this somewhere in the logs this morning:

Redis: READONLY You can't write against a read only replica.

Not sure if it's related though.

robacarp commented 2 years ago

Is it possible that the job config is being written to a redis pooled connection that is read-only, causing later look-ups to fail? That would be surprising because even retrieving a job from redis is a write operation. I don't think mosquito could function at all with a read-only redis session.

Does this stack trace crash the worker, or just throw an exception and allow the worker to continue?

Nothing looks out of place in your systemd configuration to me, but I'm not much more than a systemd hobbyist myself. I do know that the respawn parameters are where I've seen a lot of my own problems. Systemd has some default logic about respawning failing tasks, and I'd wild-guess that's on some sort of back-off.

wout commented 2 years ago

The issue with Redis is not completely clear. Especially because it runs smoothly for hours on end, to then crash for no apparent reason. It looks like a common problem, here's at least one example without a solution: https://github.com/laravel/horizon/issues/748.

Yesterday evening it went down again after these errors popped up:

Unhandled exception: READONLY You can't write against a read only replica. (Re>
from /usr/share/crystal/src/string.cr:5086:13 in 'receive'
from lib/redis/src/redis.cr:311:7 in 'command'
from lib/redis/src/redis/command_execution/value_oriented.cr:13:9 in 'run'
from lib/mosquito/src/mosquito/task.cr:85:10 in 'run_next_task'
from lib/mosquito/src/mosquito/runner.cr:132:9 in 'run'
from lib/mosquito/src/mosquito/runner.cr:21:9 in 'start'
from src/app_worker.cr:11:1 in '__crystal_main'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /path/to/app/current/app_worker in '_start'
from ???
Feb 14 21:44:07 app-name app_worker[1783836]: {"severity":"Info","source":"mosquito","timestamp":"2022-02-14T21:44:07+00:00">
Feb 14 21:44:07 app-name app_worker[1783836]: Unhandled exception: READONLY You can't write against a read only replica. (Re>
from /usr/share/crystal/src/string.cr:5086:13 in 'receive'
from lib/redis/src/redis.cr:311:7 in 'store_hash'
from /usr/share/crystal/src/int.cr:540:7 in 'start'
from src/app_worker.cr:11:1 in '__crystal_main'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /path/to/app/current/app_worker in '_start'
from ???
Feb 14 21:44:07 app-name app_worker[1783855]: {"severity":"Info","source":"mosquito","timestamp":"2022-02-14T21:44:07+00:00">
Feb 14 21:44:07 app-name app_worker[1783855]: Unhandled exception: READONLY You can't write against a read only replica. (Re>
from /usr/share/crystal/src/string.cr:5086:13 in 'receive'
from lib/redis/src/redis.cr:311:7 in 'store_hash'
from /usr/share/crystal/src/int.cr:540:7 in 'start'
from src/app_worker.cr:11:1 in '__crystal_main'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /path/to/app/current/app_worker in '_start'
from ???

I've now deployed mosquito 1.0.0-rc1 and switched to mosquito's queueing feature with run_from. That's all running smoothly by the looks of it, but I'll know more tomorrow or in a few days.

One thing I can think of that may have caused the crashes is the half-baked queue scoping solution I had before. The jobs were skipped with a return, like:

def perform
  return unless mosquito_job_role?("bookkeeper")
  ...
end

Meaning that all workers always called perform on all jobs, even if that wasn't necessary. Maybe that caused the occasional clash? Not sure...

If this doesn't help I'm going to start the Redis service (Docker Compose) with appendonly yes. But I first want to see if this change eliminates the issue.

UPDATE: The two services already crashed within an hour after deploying. Someone with the same issues switched from a dockerized Redis to a PPA installation. That fixed it for him, so I tried that as well now.

robacarp commented 2 years ago

Interesting details regarding the redis error. Thanks for sharing. Perhaps that makes the postgres backend a more ripe feature.

The short circuit perform you have there isn't ideal because the job would either be marked as a failure, or from what you have there, marked as a success (which is probably worse...).

The ideal solution for what you want certainly is the run_from/job_type -- but if you need to short circuit for some other reason, the new before hook is probably the best answer. That facility exists in mosquito as of 0.11.1, though after hooks haven't been backported from the v1.0-rc.

wout commented 2 years ago

By the looks of it, replacing Redis fixed the issue. Normally I'd wake up with some or all of the services down. So I'm mildly optimistic, but I'll have to keep an eye on it for the next few days.

I hear you about the short circuit setup. How it is now, with run_from/job_type, is the way to go.

The Postgres backend is definitely something I'd prefer. Fewer dependencies = better.

robacarp commented 2 years ago

Glad to hear it. The Postgres backend is going to be an experimental feature which I'm planning to implement after 1.0 settles down. I appreciate you testing the 1.0-rc in your environment, it's a wonderful data point to have.

mosquito-cr / mosquito

Job crashing with Missing hash key: "limit" (KeyError) #83