Restarted node takes a long time to become useful due to re-processesing already terminal allocations

marvinchin commented 5 months ago

Nomad version

1.6.x

Operating system and Environment details

Unix

Issue

When a client is restarted, it spends a long time on startup re-processing allocations that are already known to be terminal and had previously been GC-ed from it's state.

The result of this is that the client node takes a long time to start new allocations that are assigned to it, as it does not poll the server for new allocations until it is done processing the initial set of allocations it receives. In our environment, we have observed this delay to be in the order of minutes for busy client nodes.

This is similar to https://github.com/hashicorp/nomad/issues/16381, but focuses on the impact on the client. https://github.com/hashicorp/nomad/pull/17074 has fixed the issue on the server which causes the modify index of the allocations to be bumped on client restart, but the impact on client still remains.

Relevant code

Looking at the client's logic for filtering out allocations to process: https://github.com/hashicorp/nomad/blob/594fedbfbc4f0e532b65e8a69b28ff9403eb822e/client/client.go#L2374-L2384

It appears that it will pull allocations which are both:

Terminal
Does not have an alloc runner

My understanding is that such allocations need not be pulled, since they are already known to be terminal and there is no local state for them. If this is true, then we can simply update the filtering logic to filter out such allocations, and that should fix the issue.

Reproduction steps

Same as in https://github.com/hashicorp/nomad/issues/16381, in particular the part where we observe in the client logs that it pulls an allocation even though both the allocations it receives are already terminal.

Expected Result

Clients should not re-process allocations which are terminal on restart.

Actual Result

Clients re-process terminal allocations on restart, causing delays in starting new allocations.

tgross commented 5 months ago

Hi @marvinchin, nice to hear from you again!

My understanding is that such allocations need not be pulled, since they are already known to be terminal and there is no local state for them. If this is true, then we can simply update the filtering logic to filter out such allocations, and that should fix the issue.

The catch here is that the client doesn't know the allocations are terminal unless it's already seen them, and it doesn't know whether it's seen the allocation before on startup if it doesn't have an alloc runner. And the server doesn't necessarily know the client doesn't have an alloc runner for them, because the client could have restarted and come back up without missing a heartbeat.

If we zoom out a bit from your code snippet to client.go#L2361-L2385:

for allocID, modifyIndex := range resp.Allocs {
    // Pull the allocation if we don't have an alloc runner for the
    // allocation or if the alloc runner requires an updated allocation.
    //XXX Part of Client alloc index tracking exp
    c.allocLock.RLock()
    currentAR, ok := c.allocs[allocID]
    c.allocLock.RUnlock()

    // Ignore alloc updates for allocs that are invalid because of initialization errors
    c.invalidAllocsLock.Lock()
    _, isInvalid := c.invalidAllocs[allocID]
    c.invalidAllocsLock.Unlock()

    if (!ok || modifyIndex > currentAR.Alloc().AllocModifyIndex) && !isInvalid {
        // Only pull allocs that are required. Filtered
        // allocs might be at a higher index, so ignore
        // it.
        if modifyIndex > pullIndex {
            pullIndex = modifyIndex
        }
        pull = append(pull, allocID)
    } else {
        filtered[allocID] = struct{}{}
    }
}

The resp variable at the top is a NodeClientAllocsResponse which contains a map of alloc IDs to modify indexes, but not the desired status of the allocation. We could probably add a new field with the status to Node.GetClientAllocs RPC with the list of stopped allocs. Then watchAllocations can skip pulling those if it doesn't have an allocrunner for that ID.

It might even be possible to leave those stopped allocs out of the .Allocs field entirely, but I suspect that we're going to want to keep those to get updated allocations for shutdown... I'd need to go spelunking a bit to check for that.

marvinchin commented 5 months ago

Likewise :)

The resp variable at the top is a NodeClientAllocsResponse which contains a map of alloc IDs to modify indexes, but not the desired status of the allocation.

Ah, I didn't notice that! Thanks for pointing that out.

It might even be possible to leave those stopped allocs out of the .Allocs field entirely, but I suspect that we're going to want to keep those to get updated allocations for shutdown...

Hm, that's an interesting idea. My immediate thought is that we wouldn't be able to distinguish between when an allocation has been GC-ed from the server, vs. when an allocation has reached a terminal client state. But maybe it doesn't matter because they should be handled the same way anyway? I'm not confident in this at all though!

marvinchin commented 4 months ago

Hi @tgross, we might be able to work on a PR for this issue 😄

Have you had the chance to see whether or not leaving out the stopped allocs in the .Allocs field is a workable approach? Otherwise, I'm tempted to go with the more straightforward approach of adding the desired status of the allocation to the response and filtering based on that. Let me know which you'd prefer!

tgross commented 4 months ago

Hi @marvinchin, that'd be great!

Have you had the chance to see whether or not leaving out the stopped allocs in the .Allocs field is a workable approach?

Yeah, we almost certainly don't want to do that. Because if the alloc is simply missing from the list, the client uses that as a signal that it not only should stop the alloc but that it should be client-side garbage collected (because the server doesn't know about it anymore... this is typically going to happen for clients that have been offline for a while and have reconnected). That would cause allocation logs to be immediately cleaned up after an allocation gets rescheduled, and we don't want that.

marvinchin commented 2 months ago

Sorry for the delay, just wanted to mention that I haven't managed to find the time to work on this yet, in case anyone else is interested in picking this up.

In the meantime we've been working around this by reducing the GC interval to avoid accumulating of too many allocations in the state.

hashicorp / nomad