lock optimization for very large jobs table. Possible?!

samsondav / rihanna

Rihanna is a high performance postgres-backed job queue for Elixir

MIT License

439 stars 49 forks source link

lock optimization for very large jobs table. Possible?! #92

Closed florinpatrascu closed 4 years ago

florinpatrascu commented 4 years ago

Hi there - we have a table with more than a million jobs, and the WITH RECURSIVE jobs AS... query used by the Rihanna.Job.lock/3 is timing out right away. This query takes more than 45 seconds to return, if we relax the dbconnection timeout. Anyone here encountered this situation, and if yes, how are you handling it? Can that query be optimized? Many thanks!

And here is the psql plan, in case it will be useful for troubleshooting: https://explain.depesz.com/s/s8vc

tpitale commented 4 years ago

You have an external disk merge sort that is taking 2.5gb. I'm guessing that is slow. I'm not sure why it is doing that, however. Can you share your current jobs table structure (specifically the indexes)?

tpitale commented 4 years ago

According to this, which is a little old, our use of due_at NULLS FIRST may be going in the opposite direction from the way the index is built (nulls last). https://www.postgresql.org/docs/9.1/indexes-ordering.html

We might need to change the index. Not sure that will fix your particular problem though.

florinpatrascu commented 4 years ago

The table structures and the associated index are the ones created by the migration. We’re using the newest Rihanna version.

But you are confirming our findings. My team found that part of the query being the possible culprit. We created a new index:

CREATE INDEX rihanna_jobs_locking_index ON rihanna_jobs (priority, due_at ASC NULLS FIRST, enqueued_at, id);

and our query duration, for the same data set, now dropped bellow a second!

We’re monitoring our system currently, and I’ll return with a definitive confirmation that this is indeed the fix, but so far it appears to be so, at least for our case :)

tpitale commented 4 years ago

Good to hear. Would love to see the new EXPLAIN after that change, if you have it available.

florinpatrascu commented 4 years ago

sure thing, here it is: https://explain.depesz.com/s/FPZJ

Mind you, in order to verify this solution, we're using a test table with more than 3 times the number of jobs than used originally, when we first discovered the issue; 3,409,126 records (jobs), respectively.

I believe it can be optimized even further?!

added the second plan as an optimization to the original - https://explain.depesz.com/s/s8vc, for brevity.

tpitale commented 4 years ago

That index scan looks much better.

florinpatrascu commented 4 years ago

indeed

samsondav commented 4 years ago

Good find. Sounds like the index is the wrong way round indeed. Anybody care to open a PR?

tpitale commented 4 years ago

I’m on it. What would you like to do for folks already on this version? Just a doc with the usual bits about using concurrent to avoid downtime?

samsondav commented 4 years ago

@tpitale Just a doc is fine.