Closed jorgemarey closed 8 months ago
Hi @jorgemarey π
Thanks for the detailed report. It does seem like something went wrong with your target plugin, as the request for status has been hanging for 16299 minutes
. Just out of curiosity, which target plugin is this policy using?
Looking at the code again it may be better to handle ticks in a goroutine to allow the policy reload to be handled concurrently. Though in this case it may not solve the problem if the plugin status fails again.
Hi @lgfa29 ,
The target that I'm using is a plugin that I developed https://github.com/jorgemarey/nomad-nova-autoscaler
Could it makes sense to wrap the timeout used here with another just for the request to avoid problems on the plugin?
I should add a timeout here to avoid this, but I think the autoscaler should be the one that terminates the connection if it takes too long to avoid being blocked.
Ahh I see. A timeout would be helpful, though the question of what "too long" means is never easy to answer π
For target status that's probably easier to define, so maybe it could be a plugin config?
Another thing that I would be curious is if cancelling that doneCTX
actually ends the request of if plugins still have to handle it somehow. If they're not checking for a done context it may not have much effect.
Another thing that I would be curious is if cancelling that
doneCTX
actually ends the request of if plugins still have to handle it somehow. If they're not checking for a done context it may not have much effect.
That method ends up doing the grpc call to the plugin so I guess the grpc connection is the one that will terminate if context is canceled (but I'm not sure), the target plugin doesn't receive any context.
Maybe we need to change the methods here to pass that context to the plugin implementation, although that would break compatibility. Maybe we could add another interface (TargetWithContext) that defines the same methods with the addition of the context and check whether or not the plugin is implementing it.
For target status that's probably easier to define, so maybe it could be a plugin config?
We could do that and use the config that the method receives to enforce the timeout, but as this is the timeout is configured in the nomad-autoscaler code it should be a common configuration as is the dry-run config
I cloud try and make a PR for this. What are your thoughts on the best approximation?
Maybe we could add another interface (TargetWithContext) that defines the same methods with the addition of the context and check whether or not the plugin is implementing it.
I have a vague impression that we may have tried this at some point but the context
was not being properly serialized over-the-wire, so external plugins didn't actually receive the context
π€
The possible added complication is that not all API SDKs support cancellable requests (include Nomad π¬), so cancelling the context may not actually stop anything and you end up with multiple scaling actions in parallel.
But anything other that scale seems fine to cancel if it takes too long. We could even use the evaluation_interval
as the timeout, or maybe a fraction of it to make sure the next evaluation is triggered right away (say 75% of evaluation_interval
), to avoid yet-another-config.
Could you try testing if cancelling the grpc context ublocks the Autoscaler?
Hi @lgfa29. Did the following change on the nomad-autoscaler to test the behaviour (wrap the context with a hardcoded 10 seconds timeout):
// Status is the gRPC client implementation of the Target.Status interface
// function.
func (p *pluginClient) Status(config map[string]string) (*sdk.TargetStatus, error) {
fmt.Println("BEGIN", time.Now())
defer func() {
fmt.Println("END", time.Now())
}()
ctx, cancel := context.WithTimeout(p.doneCTX, 10*time.Second)
defer cancel()
statusResp, err := p.client.Status(ctx, &proto.StatusRequest{Config: config})
if err != nil {
return nil, err
}
return &sdk.TargetStatus{
Ready: statusResp.Ready,
Count: statusResp.Count,
Meta: statusResp.Meta,
}, nil
}
And a plugin with this behavior (just wait longer than the timeout):
// Status satisfies the Status function on the target.Target interface.
func (t *TargetPlugin) Status(config map[string]string) (*sdk.TargetStatus, error) {
time.Sleep(120 * time.Second)
resp := &sdk.TargetStatus{
Ready: true,
Count: 1,
Meta: make(map[string]string),
}
return resp, nil
}
The autoscaler shows this:
BEGIN 2023-07-30 00:58:33.842233311 +0200 CEST m=+60.222911053
END 2023-07-30 00:58:43.84329303 +0200 CEST m=+70.223970767
2023-07-30T00:58:43.843+0200 [WARN] policy_manager.policy_handler: failed to get target status: policy_id=8cad1831-72d8-c574-4bd7-8124c1cc037b error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
So it seems that canceling the contexts finish the rpc call and unblocks the autoscaler.
I like the idea of providing a fraction of the evaluation_interval as timeout for the Status. The thing is that I don't know how to reach that value in the Status call...
The thing is that I don't know how to reach that value in the Status call...
Oh that's a good point π
The way the plugin interface was designed it attempts to decouple it from the policy itself to avoid plugins implicitly relying on arbitrary values from the policy (like we're trying to do here π¬), which could make documentation and debugging a nightmare.
Passing a context to specific operations, as you first suggested, would probably be the best approach, but that would be a bigger effort to make sure APIs are backwards compatible. I think we would need to define a new interface and do some type assertion whenever we call a plugin method.
For now, a simpler option may be to inject the timeout into the config
map and then check for it in the gRPC request. This could be done during policy parsing, perhaps here would be the best place?
https://github.com/hashicorp/nomad-autoscaler/blob/37636ebb0428a63cd8490507c46cde24d904d652/policy/policy.go#L32-L49
One possible challenge would be that we don't limit key pattern (we really should though...) so we need to pick something that is unlikely to clash with actual config values, maybe something like _nomad_autoscaler_grpc_timeout
?
We could then start validating that config keys starting with _
are invalid (hopefully nobody is using currently declaring any config key like this π
)
Version: v0.3.7 (90ad44d)
We changed a template in nomad, nomad said that it signaled the autoscaler but in the logs I don't see the autoscaler realoading the configuration.
I sent a sigabrt to see the goroutines:
I think the problem is that it's blocked here:
Due to (I think):
I don't know why the grpc transport is blocked, but maybe we could add another context per request to the grpc calls to avoid this issues?
Thanks.