Nomad stops rescheduling lost allocations of a batch job

Fuco1 commented 3 months ago

Nomad version

Nomad v1.8.2
BuildDate 2024-07-16T08:50:09Z
Revision 7f0822c1e4f25907d9f60e2d595411950dd1bd28

Operating system and Environment details

Linux nomad-server-00000A 6.2.0-1011-azure #11~22.04.1-Ubuntu SMP Wed Aug 23 19:26:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Issue

I have a job with the following settings on a group:

count = 30000

## If the allocation fails, it's probably a programming error and I
## don't want to restart it forever.
restart {
    attempts = 2
    delay    = "15s"
    interval = "24h"
    mode     = "fail"
}

## Reschedule lost allocations forever.
reschedule {
    delay          = "5s"
    delay_function = "constant"
    unlimited      = true
}

## Replace the existing allocation with a new one on a different
## machine, the old machine will never come back because it was
## deleted by cloud provider to reclaim capacity.
disconnect {
    lost_after = "1m"
    replace    = true
    reconcile  = "keep_replacement"
}

The job runs some mathematical calculations, each task takes about 5-20 minutes to complete. We don't care about the allocations being interrupted as they can simply be run again to produce the same result and therefore we run them on cloud "spot instances", where the cloud provider can delete the machines at any point when they require that capacity for reserved customers. Therefore, many allocations often become lost (not failed).

With the above settings, I'm trying to have Nomad try to reschedule lost allocations forever on the remaining fleet which we also periodically rescale up to the required target number (currently we are aiming for 9400 CPUs with 1 CPU per task).

However, when the amount of lost allocations goes over about 2000, nomad stops rescheduling them and at the end of the day, there are often up to 250 tasks which did not finish.

Are the lost allocations treated somehow in a special way or differently than failed ones? Is there any server / client config I could change? I can provide more info but I don't really know what is relevant.

Reproduction steps

I don't have reproduction steps.

Expected Result

All the tasks finish even after being rescheduled multiple times, since I set the policy to unlimited reschedules.

Actual Result

Some tasks are never completed.

jrasell commented 3 months ago

Hi @Fuco1 and thanks for raising this issue. I've taken a look through the task job specification options and I would expect "lost" allocations to be replaced as you too expect.

One thing that did cross my mind was whether there was a need for the disconnect block? Without this, a spot instance going away would simply result in the allocation being failed and rescheduled. I suspect though this config option is being used to account for temporary network instability or other general intermittent gremlins?

It might be useful to look at the status of a lost allocation and try and track any followup evaluations and other objects as a result. Trying to piece together the series of events of one allocation which is impacted would be very useful.

Fuco1 commented 3 months ago

Thanks @jrasell for the response.

One thing that did cross my mind was whether there was a need for the disconnect block?

I only added this recently, maybe two weeks ago when I upgraded from 1.7 to 1.8 where this was added. It has no other purpose than me thinking it might help :blush: The restart and reschedule blocks were like this also on 1.7.

Without this, a spot instance going away would simply result in the allocation being failed and rescheduled.

I'm not sure about the failed here. The allocations are definitely not marked as failed, at least not in what the nomad API reports. Do you mean internally Nomad would treat them same as failed? I didn't read the code but it would be interesting how the reschedule block is interpreted in the case of the lost allocations (I can imagine the restart block only really makes sense for failed, since you can't "restart" a lost allocation)

We are also having a weird thing come up in these cases, when the number of completed jobs is higher than the number of allocations. Let's say I set count = 10k, then I see 10k allocations created, but then after some time and some of them being lost, there will be 10200 completed. Where are these 200 coming from? I understand that completed + failed + lost >= count, because one allocation can be failed and then completed. But I don't understand how can there be more completed allocations than submitted.

It might be useful to look at the status of a lost allocation and try and track any followup evaluations and other objects as a result. Trying to piece together the series of events of one allocation which is impacted would be very useful.

I just got an idea, is there some internal queue max length that the evaluations will simply get dropped? We are also seeing that the leader server is hitting very close to max RAM at some point during the computation. Maybe some corruption there? When we started couple years ago with smaller workloads we were using 2 GB, now we're using 3 server nodes with 8 GB each.

I will try to configure the server nodes to dump some debug logs (if there is such an option) to elastic and maybe get something out of it.

hashicorp / nomad