scheduler: avoid "infinite" requeuing

tiborsimko commented 2 years ago

Current behaviour

Observed on REANA 0.8.0.

If some workflow cannot be scheduled for a reason or another, it is requeued for later:

$ kubectl logs reana-server scheduler
2022-02-07 10:03:47,745 | root | MainThread | INFO | Received workflow: {'user': 'uuuu', 'workflow_id_or_name': 'wwww', 'parameters': {'input_parameters': {}, 'operational_options': {}}, 'priority': 98, 'min_job_memory': 12884901888.0}
2022-02-07 10:03:47,943 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:03:52,950 | root | MainThread | INFO | Received workflow: {'user': 'uuuu', 'workflow_id_or_name': 'wwww', 'parameters': {'input_parameters': {}, 'operational_options': {}}, 'priority': 98, 'min_job_memory': 12884901888.0}
2022-02-07 10:03:53,289 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:03:58,291 | root | MainThread | INFO | Received workflow: {'user': 'uuuu', 'workflow_id_or_name': 'wwww', 'parameters': {'input_parameters': {}, 'operational_options': {}}, 'priority': 98, 'min_job_memory': 12884901888.0}
2022-02-07 10:03:58,563 | root | MainThread | INFO | REANA not ready to run workflow WWWW.

This can go on for a very long time:

$ kubectl logs reana-server scheduler | grep 'REANA not ready to run workflow WWWW. | head
2022-02-07 10:03:47,943 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:03:53,289 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:03:58,563 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:04:03,790 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:04:08,992 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:04:14,184 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:04:19,376 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:04:24,583 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:04:29,778 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-07 10:04:34,981 | root | MainThread | INFO | REANA not ready to run workflow WWWW.

$ kubectl logs reana-server scheduler | grep 'REANA not ready to run workflow WWWW. | tail
2022-02-08 09:35:20,398 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-08 09:35:25,623 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-08 09:35:30,981 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-08 09:35:36,365 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-08 09:35:41,662 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-08 09:35:46,888 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-08 09:35:52,188 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-08 09:35:57,434 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-08 09:36:02,691 | root | MainThread | INFO | REANA not ready to run workflow WWWW.
2022-02-08 09:36:07,955 | root | MainThread | INFO | REANA not ready to run workflow WWWW.

$ kubectl logs reana-server scheduler | grep 'REANA not ready to run workflow WWWW. | wc -l
16066

This can cause scheduling troubles in case a workflow has very high priority.

Expected behaviour

If a workflow cannot be scheduled for more than N times, it is better to make it fail and report it back to user and "liberate the queue".

We should make the behaviour configurable and exposed to cluster admins via Helm environment variables, similarly to REANA_SCHEDULER_REQUEUE_SLEEP, so that the cluster admins can decide as to how many times a workflow rescheduling can fail.

We should configure reasonable defaults, so that the above use case when a workflow was unsuccessfully rescheduled more than 16000 times wouldn't happen.

(A good default might be perhaps 100 or 200 times, which with the default requeue sleep of 15 seconds would give about an hour.)

PRs:

audrium commented 2 years ago

We had a retry count mechanism in the past which was implemented in this PR set (context can be found here), probably some of the logic could be reused. Unfortunately I don't remember the reason why we have decided to remove it

tiborsimko commented 2 years ago

Using MQ delayed plugin and dead letter exchange etc could be perhaps too complex... We might perhaps just count the number of "not ready to run workflow uuuu" situations inside the app in a global dictionary and rely on that? Basically, I guess we can aim at addressing just this particular requeuing reason, not aiming at a global solution. We don't have so many workflows to start in parallel, so simply holding this in memory and removing once queued successfully might do?

audrium commented 2 years ago

Using MQ delayed plugin and dead letter exchange etc could be perhaps too complex... We might perhaps just count the number of "not ready to run workflow uuuu" situations inside the app in a global dictionary and rely on that? Basically, I guess we can aim at addressing just this particular requeuing reason, not aiming at a global solution. We don't have so many workflows to start in parallel, so simply holding this in memory and removing once queued successfully might do?

Alternative approach would be to hold the retry count in the message headers like it was done here while publishing the message and here while consuming the message. This could be even be simpler than keeping track of it in the memory?

VMois commented 2 years ago

We might perhaps just count the number of "not ready to run workflow uuuu" situations inside the app in a global dictionary and rely on that

This is not a good approach. In case of consumer restart, we lose the data. @audrium approach is better. We could use the header and re-send a message to the queue again.

Would be good to know why rabbitmq delay plugin didn't work it looks like a nice solution.

VMois commented 2 years ago

One issue with the manual re-sending approach is that we have a non-obvious mechanism to delay a message. We will need to set a retry_count to a high number so together with re-queue sleep of 15 seconds we will get the TTL of the message of about an hour or two.

In addition, sleep blocks the whole queue from processing other messages. In case, of an error, the queue is blocked for 15 seconds. Not a problem when not many workflows, can be an issue with a high number of them.

Also, there is a question of a scheduling algorithm and order of workflows raised before here.

I like rabbitmq delay plugin approach so would be good to know why it is was discarded.

tiborsimko commented 2 years ago

This is not a good approach. In case of consumer restart, we lose the data

That is not a big deal. The workflow would simply be requeued for 1 more hour. The pod restarts very rarely, so it is not even probable that this would happen... That said, acting on a retry count is definitely better, if that solution is robust. I was simply musing about a simple concrete solution to fix the problem at hand without having to rely on those plugins.

VMois commented 2 years ago

I created a demo with simplified retry and priority - https://github.com/VMois/rabbitmq-python-kombu-retry . You can check the scheduling_priority folder, and output.txt for example consumer output. With this approach, if high-priority workflow fails it will usually "skip" one cycle and repeat again until MAX_RETRIES is reached.

I don't think having like 100 retries is a viable choice in this approach. In the worst case, when a workflow fails all the time, it can delay other workflows by quite some time. I would propose putting numbers like 5 times. In case the cluster is busy after 5 tries workflow should be considered as failed.

VMois commented 2 years ago

In the current scheduler, it is not possible without modification to know what was an error about (workflow limit reached or memory is not enough) because reana_ready returns only True/False value.

question: Not sure how we should respond to users in case of scheduling errors. Any ideas?

In addition, workflows in the queued state cannot transition to failed so we will need to modify workflow transitions too.

audrium commented 2 years ago

I created a demo with simplified retry and priority - https://github.com/VMois/rabbitmq-python-kombu-retry . You can check the scheduling_priority folder, and output.txt. With this approach, if high-priority workflow fails it will usually "skip" one cycle and repeat again until MAX_RETRIES is reached.

I had a brief look into the demo. In my opinion it's not a problem to skip one cycle for a requeued message. I'm curious why it was not skipped here? Also, the demo uses rabbitmq-delayed-message-exchange, but the same functionality could be implemented without it as it was done in PRs listed in my previous comment.

I don't think having like 100 retries is a viable choice in this approach. In the worst case, when a workflow fails all the time, it can delay other workflows by quite some time. I would propose putting numbers like 5 times. In case the cluster is busy after 5 tries workflow should be considered as failed.

With only 5 times and 15 seconds of delay time, the workflow would fail already after 1 minute and 15 seconds if there are no other workflows in the queue. I think it's a bit too early. In case we have a busy cluster running in full capacity there could be not enough memory for a workflow to start immediately and in this case it should stay a bit longer in a queue.

Another idea to prevent from a worst case scenario with a never starting workflow would be to decrease workflow priority while requiring. In this case high priority queue which is constantly looping would let other workflows to start and would not block the queue. WDYT?

VMois commented 2 years ago

I'm curious why it was not skipped here?

My guess. RabbitMQ needs to re-order messages in a queue due to priority. From time to time, it can re-prioritize messages faster so the cycle is not skipped.

the demo uses rabbitmq-delayed-message-exchange

The plugin is disabled, you can check the Dockerfile.

VMois commented 2 years ago

I think it's a bit too early. stay a bit longer in a queue.

This is a good point. Right now, the workflow will stay forever in the queue until scheduled. We might need to differentiate between errors and simply wait longer for the cluster to be free. For example, certain errors like not a single node can accommodate your workflow should return errors immediately. Some other errors like max number of workflows that can run is reached should reschedule workflow forever until it is scheduled or the user stops it (mimic current behavior, I guess).

to decrease workflow priority while requiring.

Yes, it looks like a better idea. But we will need to think about what impact it might have on users, and we will need to re-test the scheduling algorithm. Adjusting a demo I created should help a bit.

Also, I haven't tested multiple failed workflows at the same time. Rare event but still. This might introduce an even bigger delay. That is why I liked the delay plugin. It would really good to know what exactly was wrong with the delay plugin. I heard some issues with priority, but no details.

VMois commented 2 years ago

not a single node can accommodate your workflow should return errors immediately.

This error is not correct. My mistake. reana_ready checks for available memory in nodes.

VMois commented 2 years ago

question(blocking):

If a workflow cannot be scheduled for more than N times, it is better to make it fail and report it back to user and "liberate the queue".

In case of workflow is okay but the cluster is just busy should we still apply the retry mechanism or allow it to go infinitely until the user decides to stop it? @tiborsimko

For example, I can start a workflow, it will get queued, I will go to sleep, wake up the next day, and see results. In the case of the retry mechanism applied, the workflow will fail, let's say, in 4 hours which will not make a user happy, I guess.

Let's say we set up some big wait time like 24 hours. If an error is because of incorrect workflow, we will have it stuck in a queue for 24 hours. As a solution for this, we can have different retry counts for "busy cluster" and "problematic workflow" errors. Would be good to get some directions on this as it is not clear for me.

BTW, currently, users are waiting an infinite amount of time until the cluster is ready which can be bad and to some degree good.

tiborsimko commented 2 years ago

we can have different retry counts for "busy cluster" and "problematic workflow" errors

Very valid considerations. Here's my take on this:

The "problematic workflow" errors should be caught early. E.g. the platform should not even start a workflow that asks for more memory than we "theoretically" have. Such a workflow should not be accepted for scheduling. (This is the core issue treated elsewhere.) That said, we may have coding bugs, so occasionally some workflows could slip through, but hopefully this will be rare. (We shall strengthen this when we shall be moving workflow validation from client-side to cluster-side, so that "problematic workflows" would be caught and refused within seconds or minutes, and not accepted for running. So the feedback to the user would be immediate.)
The "busy cluster" situation is where the requeuing happens. Here we have a upper guard of not requeueing more than N times, equal to say 1h or 4h of 24h, based on what the cluster admin decides to set up during cluster deployment via Helm values. So the feedback to the user would be a fixed value decided by the admins. (This is MVP for this issue.)
The distinction between users being sometimes happy to wait 1 hour, sometimes happy to wait 24 hours, is very valid. Ultimately, this depends on the nature of each given workflow, and the urgency. Hence, the ultimate solution would be to expose "waiting times" to the users, just as we expose kubernetes memory limits or kubernetes runtime limits. We could introduce reana_scheduling_timeout that users could configure in their reana.yaml specification files. This will give ultimate liberty to users to decide how long they are willing to wait. (This is going well beyond MVP; we won't touch it here, obviously. But it is something we may consider later.)

TL;DR For the issue at hand, I think it would be sufficient to requeue at most N times, regardless of the reason, where N is the value fixed by the admin during deployment. (I.e. the middle point above.)

VMois commented 2 years ago

based on what the cluster admin decides to set up during cluster deployment via Helm values.

How these values should look like in the Helm chart?

For now, we have requeue sleep which equals, by default, 15 seconds. Because we are not using the delay plugin, our delay is approximated by the number of retries multiplied by requeue sleep. So in case we want 2 hours delay and requeue sleep is 15 seconds, we will need around (2 * 3600) / 15 = 480 retries.

Why approximately? Because the total delay time is not constant. It depends on other workflows in a queue. Example of current behavior:

Other workflows present

Workflow 1 failed... retry
Workflow 2 scheduled
Workflow 1 failed... retry 1
Workflow 3 scheduled
Workflow 1 failed... no more retries left
# took 5 * 15 = 75 seconds to fail workflow 1

One workflow only

Workflow 1 failed... retry
Workflow 1 failed... retry 1
Workflow 1 failed... no more retries left
# took 3 * 15 seconds = 45 seconds to fail workflow 1

Details

I see two approaches:

A. Allow specifying retry_count directly in Helm. In this case, it is the responsibility of the admin to calculate desired delay time and set requeue sleep and retry count values.

B. Allow specifying delay time in Helm and calculate the number of retries by ourselves based on requeue sleep and delay time provided. It is a bit trickier to implement. For example, what if we will set requeue sleep to zero as we did for ATLAS stress tests?

Honestly, not a fan of either approach, but something needs to be chosen.

What approach is better for now? Is any other approach possible?

tiborsimko commented 2 years ago

The option A is what I had in mind in the original description, see suggested value of 100-200 above.

We currently have REANA_SCHEDULER_REQUEUE_SLEEP, so the new Helm value could be called REANA_SCHEDULER_REQUEUE_COUNT for example.

reanahub / reana-server

scheduler: avoid "infinite" requeuing #436