mojolicious / minion

:octopus: Perl high performance job queue
https://metacpan.org/release/Minion
Artistic License 2.0
221 stars 55 forks source link

`remove_after` no longer respects task dependencies #129

Open davidolrik opened 1 month ago

davidolrik commented 1 month ago

After the performance update in v10.27, Minion no longer respects its dependencies as stated in the documentation.

This means that a tree of dependant tasks might be severed if tasks in the "middle" finishes before the top and bottom of the tree.

I guess this is on purpose, but it totally breaks the excellent "flow" use case that Minion used to support.

Would it be possible to add a flag that toggles between the new and the old behaviour?


Expected behavior

This is what was done previously in Pg.pm:

  # Old jobs with no unresolved dependencies and expired jobs
  $db->query(
    "DELETE FROM minion_jobs WHERE id IN (
       SELECT j.id FROM minion_jobs AS j LEFT JOIN minion_jobs AS children
         ON children.state != 'finished' AND ARRAY_LENGTH(children.parents, 1) > 0 AND j.id = ANY(children.parents)
       WHERE j.state = 'finished' AND j.finished <= NOW() - INTERVAL '1 second' * ? AND children.id IS NULL
       UNION ALL
       SELECT id FROM minion_jobs WHERE state = 'inactive' AND expires <= NOW()
    )", $minion->remove_after
  );

Actual behavior

This is the current code in Pg.pm, that deletes all finished jobs

  # Old jobs
  $db->query("DELETE FROM minion_jobs WHERE state = 'finished' AND finished <= NOW() - INTERVAL '1 second' * ?",
    $minion->remove_after);

  # Expired jobs
  $db->query("DELETE FROM minion_jobs WHERE  state = 'inactive' AND expires <= NOW()");
kraih commented 1 month ago

The old behaviour already did not work when there were too many jobs in the queue unfortunately. But i don't quite understand your use case here. We only delete finished jobs, how could that negatively affect dependencies, where missing job and finished are treated the same?

davidolrik commented 1 month ago

I use Minion for flows, and use the task tree to render an overview of how far the process is, like this:

flowchart LR
task_a["Task A"] --> task_b["Task B"]
task_b --> task_c["Task C"]
task_c --> task_d["Task D"]

If any task in the middle finishes before A or D, and are then deleted, and D later fails I end up with no way of getting from D to A, and I have lost the context of who spawned the failing task D.

In essens I only want finished jobs to be deleted when they are not part of an "active" tree. I.e. when the job that have reached the state finished and have no unresolved dependencies.

kraih commented 1 month ago

Job B can't finish before A, that's what dependencies are all about.

kraih commented 1 month ago

Not that it matters much, the old query falls apart with a few million jobs in the queue. We couldn't bring it back even if we wanted.

HEM42 commented 1 month ago

at my company we are running Minion, with a throughput of roughly ~500k tasks a day. And ~1M tasks at most times in table.

for some reason we never did a PR on our fixed back with 10.22. we also noticed problems with large amount of jobs in the queue.

Our solution still works.

WITH interesting_jobs AS (
       SELECT
         parents,
         state
       FROM
         minion_jobs
       WHERE
         cardinality(parents) > 0
         AND state != 'finished' )
     DELETE FROM minion_jobs AS j
       WHERE (finished <= NOW() - INTERVAL '1 second' * 172800
       AND state = 'finished'
       AND NOT EXISTS ( SELECT 1 FROM interesting_jobs
         WHERE parents @> ARRAY[j.id] AND state != 'finished'))
       OR (expires <= NOW() AND state = 'inactive');

its fairly fast as well

explain analyze WITH interesting_jobs AS (
       SELECT
         parents,
         state
       FROM
         minion_jobs
       WHERE
         cardinality(parents) > 0
         AND state != 'finished' )
     SELECT FROM minion_jobs AS j
       WHERE (finished <= NOW() - INTERVAL '1 second' * 172800
       AND state = 'finished'
       AND NOT EXISTS ( SELECT 1 FROM interesting_jobs
         WHERE parents @> ARRAY[j.id] AND state != 'finished'))
       OR (expires <= NOW() AND state = 'inactive');
                                                                                       QUERY PLAN                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on minion_jobs j  (cost=23066.17..346802608.51 rows=67548 width=0) (actual time=463.859..28426.737 rows=11952 loops=1)
   Recheck Cond: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state)) OR (state = 'inactive'::minion_state))
   Rows Removed by Index Recheck: 46175
   Filter: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state) AND (NOT (SubPlan 1))) OR ((expires <= now()) AND (state = 'inactive'::minion_state)))
   Rows Removed by Filter: 26650
   Heap Blocks: exact=33423
   ->  BitmapOr  (cost=23066.17..23066.17 rows=135096 width=0) (actual time=90.466..90.467 rows=0 loops=1)
         ->  BitmapAnd  (cost=23044.85..23044.85 rows=135096 width=0) (actual time=89.573..89.573 rows=0 loops=1)
               ->  Bitmap Index Scan on minion_jobs_finished  (cost=0.00..2833.77 rows=153245 width=0) (actual time=30.673..30.673 rows=152942 loops=1)
                     Index Cond: (finished <= (now() - '48:00:00'::interval))
               ->  Bitmap Index Scan on minion_jobs_state_priority_id_idx  (cost=0.00..20177.06 rows=879551 width=0) (actual time=56.709..56.709 rows=883147 loops=1)
                     Index Cond: (state = 'finished'::minion_state)
         ->  Bitmap Index Scan on minion_jobs_state_priority_id_idx  (cost=0.00..4.43 rows=1 width=0) (actual time=0.893..0.893 rows=3 loops=1)
               Index Cond: (state = 'inactive'::minion_state)
   SubPlan 1
     ->  Bitmap Heap Scan on minion_jobs  (cost=289.45..6250.75 rows=23 width=0) (actual time=0.731..0.731 rows=1 loops=38599)
           Recheck Cond: ((parents @> ARRAY[j.id]) AND (cardinality(parents) > 0))
           Filter: ((state <> 'finished'::minion_state) AND (state <> 'finished'::minion_state))
           Rows Removed by Filter: 0
           Heap Blocks: exact=26678
           ->  BitmapAnd  (cost=289.45..289.45 rows=1663 width=0) (actual time=0.730..0.730 rows=0 loops=38599)
                 ->  Bitmap Index Scan on minion_jobs_parents_idx  (cost=0.00..41.90 rows=4989 width=0) (actual time=0.001..0.001 rows=1 loops=38599)
                       Index Cond: (parents @> ARRAY[j.id])
                 ->  Bitmap Index Scan on minion_jobs_cardinality_parents  (cost=0.00..247.30 rows=332571 width=0) (actual time=1.046..1.046 rows=28013 loops=26647)
 Planning Time: 0.401 ms
 Execution Time: 28427.121 ms
(26 rows)

we are currently moving to pg16, and found something interesting. The above query, finishes just a tiny bit faster.

BUT the most surprising is that the original code in Minion 10.22, actually runs fast.

running the query on the same data as above:

explain analyze SELECT FROM minion_jobs AS j
     WHERE (finished <= NOW() - INTERVAL '1 second' * 172800 AND NOT EXISTS (
       SELECT 1 FROM minion_jobs WHERE parents @> ARRAY[j.id] AND state != 'finished'
     ) AND state = 'finished') OR (expires <= NOW() AND state = 'inactive');
                                                                                       QUERY PLAN                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on minion_jobs j  (cost=23079.63..44015254.24 rows=67855 width=0) (actual time=106.526..240.670 rows=12766 loops=1)
   Recheck Cond: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state)) OR (state = 'inactive'::minion_state))
   Rows Removed by Index Recheck: 46175
   Filter: (((finished <= (now() - '48:00:00'::interval)) AND (NOT (SubPlan 1)) AND (state = 'finished'::minion_state)) OR ((expires <= now()) AND (state = 'inactive'::minion_state)))
   Rows Removed by Filter: 26650
   Heap Blocks: exact=33512
   ->  BitmapOr  (cost=23079.63..23079.63 rows=135711 width=0) (actual time=100.451..100.452 rows=0 loops=1)
         ->  BitmapAnd  (cost=23058.24..23058.24 rows=135711 width=0) (actual time=99.631..99.632 rows=0 loops=1)
               ->  Bitmap Index Scan on minion_jobs_finished  (cost=0.00..2847.00 rows=153943 width=0) (actual time=32.276..32.276 rows=153756 loops=1)
                     Index Cond: (finished <= (now() - '48:00:00'::interval))
               ->  Bitmap Index Scan on minion_jobs_state_priority_id_idx  (cost=0.00..20177.06 rows=879551 width=0) (actual time=64.983..64.983 rows=883147 loops=1)
                     Index Cond: (state = 'finished'::minion_state)
         ->  Bitmap Index Scan on minion_jobs_state_priority_id_idx  (cost=0.00..4.43 rows=1 width=0) (actual time=0.819..0.819 rows=3 loops=1)
               Index Cond: (state = 'inactive'::minion_state)
   SubPlan 1
     ->  Bitmap Heap Scan on minion_jobs  (cost=42.04..16158.52 rows=591 width=0) (actual time=0.001..0.001 rows=1 loops=39416)
           Recheck Cond: (parents @> ARRAY[j.id])
           Filter: (state <> 'finished'::minion_state)
           Rows Removed by Filter: 0
           Heap Blocks: exact=26678
           ->  Bitmap Index Scan on minion_jobs_parents_idx  (cost=0.00..41.90 rows=4989 width=0) (actual time=0.001..0.001 rows=1 loops=39416)
                 Index Cond: (parents @> ARRAY[j.id])
 Planning Time: 0.322 ms
 Execution Time: 241.096 ms
(24 rows)

Ok, while collecting this information, and writting. another solution presented itself

explain analyze SELECT FROM minion_jobs
WHERE id IN (
(
SELECT id FROM minion_jobs WHERE state = 'finished' AND finished <= NOW() - INTERVAL '1 second' * 172800
EXCEPT
SELECT unnest(parents) AS id FROM minion_jobs WHERE state != 'finished'
)
UNION
SELECT id FROM minion_jobs WHERE state = 'inactive' AND expires <= NOW());
                                                                                        QUERY PLAN                                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=256494.17..285054.46 rows=138181 width=0) (actual time=427.742..482.616 rows=15610 loops=1)
   Hash Cond: (minion_jobs.id = "*SELECT* 1".id)
   ->  Index Only Scan using minion_jobs_pkey on minion_jobs  (cost=0.42..25941.71 rows=997714 width=8) (actual time=0.043..35.708 rows=999174 loops=1)
         Heap Fetches: 11
   ->  Hash  (cost=254766.48..254766.48 rows=138181 width=8) (actual time=417.175..417.176 rows=15610 loops=1)
         Buckets: 262144  Batches: 1  Memory Usage: 2658kB
         ->  HashAggregate  (cost=252305.13..254766.48 rows=138181 width=8) (actual time=415.115..415.903 rows=15610 loops=1)
               Group Key: "*SELECT* 1".id
               Planned Partitions: 4  Batches: 1  Memory Usage: 3857kB
               ->  Append  (cost=0.00..244877.90 rows=138181 width=8) (actual time=411.209..413.750 rows=15610 loops=1)
                     ->  Result  (cost=0.00..244178.55 rows=138180 width=8) (actual time=411.208..413.247 rows=15610 loops=1)
                           ->  HashSetOp Except  (cost=0.00..242796.75 rows=138180 width=12) (actual time=411.207..412.678 rows=15610 loops=1)
                                 ->  Append  (cost=0.00..239497.23 rows=1319810 width=12) (actual time=0.210..398.082 rows=69663 loops=1)
                                       ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..111576.08 rows=138180 width=12) (actual time=0.210..224.810 rows=42257 loops=1)
                                             ->  Seq Scan on minion_jobs minion_jobs_1  (cost=0.00..110194.28 rows=138180 width=8) (actual time=0.209..222.820 rows=42257 loops=1)
                                                   Filter: ((state = 'finished'::minion_state) AND (finished <= (now() - '48:00:00'::interval)))
                                                   Rows Removed by Filter: 956917
                                       ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..121322.10 rows=1181630 width=12) (actual time=0.046..170.768 rows=27406 loops=1)
                                             ->  ProjectSet  (cost=0.00..109505.80 rows=1181630 width=8) (actual time=0.046..169.658 rows=27406 loops=1)
                                                   ->  Seq Scan on minion_jobs minion_jobs_2  (cost=0.00..102711.43 rows=118163 width=13) (actual time=0.040..157.409 rows=116027 loops=1)
                                                         Filter: (state <> 'finished'::minion_state)
                                                         Rows Removed by Filter: 883147
                     ->  Index Scan using minion_jobs_state_priority_id_idx on minion_jobs minion_jobs_3  (cost=0.42..8.45 rows=1 width=8) (actual time=0.012..0.012 rows=0 loops=1)
                           Index Cond: (state = 'inactive'::minion_state)
                           Filter: (expires <= now())
                           Rows Removed by Filter: 3
 Planning Time: 0.520 ms
 Execution Time: 484.704 ms
(28 rows)

this runs fast on both pg13 and pg16 (above is from pg16)

if interested, I will make a proper PR for the latest solution we have

kraih commented 1 month ago

I'm not opposed to a PR if there really are no performance issues with the solution. But i'm still quite sceptical about the use cases. Do we really need this? If we treat this as a new feature, can we find two unique use cases for it?

kraih commented 1 month ago

For the first use case i will count debugging the whole chain of jobs that led to a failure somewhere in the middle. Do we have a second?

davidolrik commented 1 month ago

I once build a provisioning flow for a telco where we had automated the whole process of moving a cellphone number between providers and becoming a new customer utomated using minion.

The whole process could take anywhere from 1 day to several months to complete depending on notice periods. For this type of flow it would have been catastrophic if the first part of the flow was deleted because it was finished.

Having the whole flow stored allowed us to render it in the "service agent ui", making it quite easy for everyone to see when the different task were performed and how many tasks there are left before the flow is done.

So second use-case would be possibility of "flow overview" (for non-technical people).

This together with the debugging case are my two favourite things about Minion.

christopherraa commented 1 month ago

At $work we are heavily using Minion for such flows, and for us dependencies have been valuable both for debugging and "flow overview".

At one client of ours there could also be pretty catastrophic if parts of a flow for long-running flows were deleted, as the later parts of a flow would compile an end-dataset based on data stored on the individual jobs in a flow. So missing parts of a chain would represent incomplete data. Naturally these datasets could be stored elsewhere, but that has not been required since minion has made this storage of job results so easy and convenient.

tobez commented 1 month ago

One thing to consider: the query in 10.22, the different query in 10.23, and both versions of the query that @HEM42 posted are equivalent to each other if we ignore the performance issues. The code introduced in 10.27 thus breaks POLA, hard.

kraih commented 1 month ago

I suppose that's good enough for two use cases. Now we just need a solution that 100% won't cause performance issues.

rshingleton commented 1 month ago

I was just coming to submit a bug report on this. The "old" routine with the union all causes our entire queue to get in a broken state and locks the entire queue.

rshingleton commented 1 month ago

Just as an aside to the OP's request, I had temporarily fixed this by simply creating a loop to delete with a limit. This allowed full deletion in "chunks". The following was a quick hack to get my queue back in working order, but could probably be refined some:

my $db = $minion->backend->pg->db;
my $limit = 50000;
my $deleteAfterSeconds = 86400;

my $data = getIds();
my $count = 0;
while (defined($data) && scalar(@{$data})) {
    my $records = scalar(@{$data});
    $count += scalar($records);
    say "deleting $records records";
    runRepairJobs($data);
    $data = getIds();
    say "deleted $count records";
    sleep 2;
}

sub runRepairJobs {
    my $results = shift;

    my @deletes;
    for my $item (@{$results}) {
        # say $item->{id};
        push(@deletes, $item->{id});
    }

    if (scalar(@deletes)) {
        my $ids = join(',', @deletes);
        my $delete = "DELETE FROM minion_jobs WHERE id IN ($ids)";
        $db->query($delete)
    }

}

sub getIds {
    my $query = qq{
        SELECT j.id FROM minion_jobs AS j
          LEFT JOIN minion_jobs AS children
             ON children.state != 'finished'
             AND ARRAY_LENGTH(children.parents, 1) > 0
             AND j.id = ANY(children.parents)
          WHERE j.state = 'finished'
             AND j.finished <= NOW() - INTERVAL '1 second' * $deleteAfterSeconds
             AND children.id IS NULL
        UNION ALL
        SELECT id FROM minion_jobs
             WHERE state = 'inactive'
             AND expires <= NOW()
             LIMIT $limit
    };

    my $results = $db->query($query)->hashes->to_array;

    return $results;
}
rshingleton commented 1 month ago

As an aside, and a quick question, why does the remove_after not delete failed jobs?

christopherraa commented 1 month ago

Speaking from our own situasjon at $work: usually failed jobs is something that should be investigated, at least for us and our clients. If a minion-job has failed it is an indication that we have done something wrong or that there is a system-specific anomaly going on. So we both have alerting on failed jobs (through prometheus-metrics) and follow up with investigative steps if the "failed job"-alerts fire. Please note though that we have jobs where the unit of work might indeed fail, but where such failure is considered acceptable. Such results would not cause a job to fail.

rshingleton commented 1 month ago

That's certainly understandable, though I would think critical failures would be handled within a reasonable amount of time and the data at some point would just be cruft hanging around in the table forever. Maybe this discussion, belongs in another place, but I think it should be addressable in the framework somehow.

christopherraa commented 1 month ago

Discussion is always good, and yes perhaps this could be a discussion on IRC, but I don't really mind having it here as it'll then be discoverable for other at a later time.

As to your point about "reasonable amount of time" I absolutely agree. The problem is that what a reasonable time is does vary quite a bit. For some of our workloads a failed minion-job should prompt immediate action, as in "drop whatever you are doing, all hands on deck"-type immediate. But I do know of at least one entity that have minion-jobs that take multiples of hours to complete, sometimes may fail and (due to some business domain-specific weirdness) cannot be retried more than once a week. The latter is a fringe case yes, but there are users who use minion for that kind of thing. For us at $work we cannot let any jobs "disappear", ie being removed by some cleanup process. I am guessing here, but I imagine this is true for others as well.

Your point about "cruft hanging around in the table forever" is something that I also absolutely agree would be bad. Though I would say that if you allowed that to happen then you have bad data hygene. Just accumulating a lot of stuff (failed or successful) is generally something I consider bad practice unless you have a specific use for that data later.

I guess what I am saying is that failed jobs is something that in my mind should be adressed specifically. Exactly how this is done could of course vary from company to company. Some might do like us and have technical personell investigate and either remove, retry immediately or schedule for a delayed retry. Others migth have workloads that are expected to fail, often do, and thus have scheduled tasks that analyze the current failed tasks and purges those that are consider "failed but OK".

All that being said, I am very curious to hear how other people / orgs use minion and how they deal with these kinds of issues.

kraih commented 1 month ago

Are there even multiple use cases for failed jobs that should just vanish that couldn't be handled by an eval in the task itself?

rshingleton commented 4 weeks ago

As I'm on-boarding this in my scenario I'm using it for temporary visibility mainly. I'm using Minion to handle task automation in multiple environments running against about 12,000 servers globally. The busiest environment is ~8000-9000 servers and is taking ~150,000 jobs per hour. The finalized job states and results get stored in a long term database and associated with different applications. Any job failures are handled in an eval/try-catch block and incidents are generated in another system. The sheer number of potential job failures due to one reason or another is too much to monitor individually or at the time of failure.

For my use case I set up a separate clean-up routine that runs periodically in my scheduler that validates states between the Minion queue and my job schedule tables in another database. At the same time I run the following:

 my $seconds = 172800;  # 2 days
 $db->query("DELETE FROM minion_jobs WHERE state = 'failed' AND finished <= NOW() - INTERVAL '1 second' * ?",
        $seconds);

This is the same query as in the repair routine except taking care of failed jobs, it suits my needs. I feel like this should be the default with an option to retain failed records with something like:

$minion = $minion->repair({ retain => 'failed' }};

If an exception hasn't been handled or extracted into a different workflow in the 2 day window is it really a critical problem?

rabbiveesh commented 4 weeks ago

A breaking change to clip failed tasks is completely unacceptable. Maybe an opt-in feature, but if you really don't care about the tasks failing, then why don't you just catch the errors and make the jobs succeed?

kraih commented 4 weeks ago

If an exception hasn't been handled or extracted into a different workflow in the 2 day window is it really a critical problem?

Weekends happen. This is not about critical errors, but errors that require user intervention.

rshingleton commented 4 weeks ago

Understood. I didn't consider it being a breaking change, didn't think about it. In my use case, I report the errors outside the queue for our Operations Center and product "owners" to review and I review/follow the queue. Anything I would miss in the queue failures that is a critical application problem is reported to the application. Most things I report as failures in the queue are (mostly) issues I would need to handle as the Minion owner.

I have considered just marking everything as finished, but I've opted not to because I like to do some reporting off those statistics and it gives me a quick visual of what's happening. If I see a sudden uptick of failures on the queue UI, it's probably indicative of a network issue or some other global issues. For instance, this week I had a slew of DNS issues where some SSH tasks were not able to make connections. I can see the errors spike from the queue and if it's a long term thing they are also (in most cases) reported off somewhere else as well.

I was bringing this up because I have been working on some capacity management in at least one of my queues where at differing levels of items in the minion_jobs table, the performance of the queue begins to degrade. The issue is closed, but I would like to suggest reviewing my last post (Minion becomes slow, admin unusable with processed jobs between 2-3 million records that I made this morning concerning the stats query from the backend which seems to be directly related to how many items are in the table.

I've added functionality to take care of this for me. I don't advocate for a breaking change. Perhaps and option for $minion->repair({ discard => 'all' }} or something similar would be a better solution if one were to be adopted.