rails / mission_control-jobs

Dashboard and Active Job extensions to operate and troubleshoot background jobs
MIT License
611 stars 71 forks source link

Slow DISTINCT query for finding queue names with Solid Queue #160

Open miharekar opened 2 months ago

miharekar commented 2 months ago

Hey 👋

I've been using SolidQueue and Mission Control - Jobs since pretty much the day they came out on my decently sized app. I've accumulated a lot of jobs, but nothing crazy:

visualizer(prod)> SolidQueue::Job.count
=> 1_498_839

However, the more time goes on, the slower Mission Control got. So I decided to have a look and turns out there's this one query that's really slow:

CleanShot 2024-09-12 at 07 27 27@2x

SELECT DISTINCT "solid_queue_jobs"."queue_name" FROM "solid_queue_jobs";

The explain is pretty straight-forward and also it's clear why this would be slow on a large table.

HashAggregate  (cost=195015.40..195015.44 rows=4 width=8)
  Group Key: queue_name
  ->  Seq Scan on solid_queue_jobs  (cost=0.00..191278.52 rows=1494752 width=8)
JIT:
  Functions: 3
  Options: Inlining false, Optimization false, Expressions true, Deforming true

I couldn't find any #distinct code in this repo, but I believe the issue starts in ActiveJob::QueueAdapters::SolidQueueExt#queues which calls SolidQueue::Queue.all which does the distinct call.

While I believe the culprit is inside Solid Queue, you have an adapter/extension for it, so maybe the fix should be here? Or maybe the SolidQueue::Queue.all itself should handle queue names better since it could be used by other gems/code.

Luckily both gems have the same maintainers, so I leave the decision to you 😄

rosa commented 2 months ago

Hey @miharekar, thanks a lot for opening this issue and for having tested this gem and solid queue since the beginning! 🙏

Ahh, this is a problem in PostgreSQL indeed 😖 I initially designed this with MySQL, and for this DISTINCT query where there's an index prefixed on solid_queue_jobs.queue_name, MySQL uses a technique called Loose Index Scan. Even with over 50M rows in that table (in our app), MySQL runs that query almost instantly, thanks to this technique. I learnt later that PostgreSQL doesn't implement anything similar to this 😭 this why the query is so slow in your case.

Fixing this is tricky because there's no other way to get all queue names in your system. The only possible way is to check any of the execution tables, but these are ephemeral, so you would only see the queue names that have jobs at any given moment, and it's possible in many cases it'd be none.

This doesn't affect Solid Queue's regular functioning because SolidQueue::Queue.all is not used for it, only for Mission Control.

The only idea I have is to perform the DISTINCT in a subset of rows in that table, only if you're not using MySQL 🤔 You'll likely miss some queues but perhaps it's better than the super slow query?

Do you have other ideas?

miharekar commented 2 months ago

Ah, interesting challenge indeed 🤔

Could it return just the queue names from the config?

rosa commented 2 months ago

Unfortunately, that would only work if you have explicit names in your config for all queues, but it's quite common to just configure

queues: *

This is even the default 😬

miharekar commented 2 months ago

Maybe read from config and fallback to distinct? 😅

rosa commented 2 months ago

Yeah, I'll see how it could work... you can't also rely on the config because you can have multiple configurations (we do) or even Mission Control running as a separate app to control multiple apps, so you need to look at the registered workers' metadata 🤔 This would also miss the case where you have a queue and no workers defined for it, you'd need to look at solid_queue_ready_executions in that case. I think it could work, aggregating info from all those sources, and falling back to DISTINCT(queue_name) from solid_queue_jobs. If the adapter is MySQL, I'd go directly with the DISTINCT query, as it's much simpler.

I won't have time to work on this in the next few weeks, though 😬 I think this could live in the adapter here in the mission control repo, yes. If you or anyone wants to take a stab at it, please feel free! Otherwise I'll pick it up when I get back to work on this repo a few weeks from now.

miharekar commented 2 months ago

Thank you so much @rosa!

I've lasted months, so I can wait a few weeks 😄 But if I find time, I'll certainly try to poke at it!

christopher-b commented 1 month ago

I did some digging on this: it appears that PostgreSQL can replicate an efficient DISTINCT query using recursive common table expressions. But it's not pretty. My first draft at implementing this looks like this:

if connection.supports_common_table_expressions?
  recursive_cte_sql = "
    SELECT min(queue_name) AS queue_name FROM solid_queue_jobs
    UNION ALL
    SELECT (SELECT min(queue_name) FROM solid_queue_jobs WHERE queue_name > queue_names.queue_name)
    FROM queue_names WHERE queue_names.queue_name IS NOT NULL
  "
  SolidQueue::Job
    .with_recursive(queue_names: Arel.sql(recursive_cte_sql))
    .select("queue_name")
    .where.not(queue_name: nil)
    .from("queue_names as solid_queue_jobs")
end

It adds big chunk of complexity to what is currently a concise class. @rosa, if you think this is worth pursuing, I can clean this up, do more testing and make a proper PR on the solid_queue repo.