seomoz / qless

Queue / Pipeline Management
MIT License
294 stars 76 forks source link

`Qless::Queue#peek` not returning correct results for recurring jobs #233

Open tpickett66 opened 9 years ago

tpickett66 commented 9 years ago

As of 515cdc8 a queue can get in an unexpected state leading to recurring jobs not being returned properly by #peek when no count argument is supplied. At this point I've traced it down to the Lua scripts by working back through the ruby code, I'll post more details as I find them.

[14] pry(main)> queue.peek
=> nil
[15] pry(main)> queue.peek(1)
=> []
[16] pry(main)> queue.peek(2)
=> [<Qless::Job SyncAccount (f5aa9babcd1e420ea38fe537812db03a-9 / power_school_jobs / waiting)>]
[17] pry(main)> queue.peek(3)
=> [<Qless::Job SyncAccount (f5aa9babcd1e420ea38fe537812db03a-9 / power_school_jobs / waiting)>, <Qless::Job SyncAccount (ed6f4e5207454cf3972459990b8eaad3-15 / power_school_jobs / waiting)>]
[18] pry(main)>
tpickett66 commented 9 years ago

Additional data from invoking the Lua 'directly'

18] pry(main)> client.call('peek', 'power_school_jobs', 1)
=> "{}"
[19] pry(main)> client.call('peek', 'power_school_jobs')
Qless::LuaScriptError: user_script:1105: Peek(): Arg "count" missing or not a number: nil
from /home/deploy/pschool/.bundle/gems/ruby/2.1.0/bundler/gems/qless-515cdc862a82/lib/qless/lua_script.rb:30:in `rescue in call'
[20] pry(main)> client.call('peek', 'power_school_jobs', 20)
=> "[{\"jid\":\"f5aa9babcd1e420ea38fe537812db03a-9\",\"retries\":5,\"data\": ...snip..."
tpickett66 commented 9 years ago

Turns out there was a lingering member in the zset ql:q:<queue_name>-work, I'm not sure how it remained after the job was removed from the queue or completed. I'll be investigating the housecleaning functions to see where/how this might have happened.

tpickett66 commented 9 years ago

So far I'm not seeing anything that would point at a root cause, the only set that seems to have had this pollution is the work set.

tpickett66 commented 9 years ago

I just discovered that the completed set has invalid values too, sadly I haven't taken good enough notes to know if there is any overlap between the extras in the work set and ql:completed. I'm going to leave this issue open while I gather more data, this problem has occurred a couple times in the past.

tpickett66 commented 9 years ago

We had more issues with one of the sorted sets being in an inconsistent state this morning, this time it was the locks set on the same queue but the inconsistent state started occurring right after we received notification of an exception of the type Qless::Job::CantCompleteError.