Closed mohsin36 closed 1 year ago
This is a great question! Ideally, Vespa could do all validations at deploy-time - in practice, this is hard as artifacts used in the Vespa Container can be arbitrarily large, with custom code.
Vespa Cloud solves this with https://cloud.vespa.ai/en/automated-deployments - deployments are run through system and staging tests, that will trap such errors. This way, one can safely deploy - the application owners can easily provide tests that validate not only startup but also complex test cases validating the application logic, before deploying to prod systems.
some changes to doc in https://github.com/vespa-engine/documentation/pull/2687
I think we can do a lot more here and afaik we don't have another issue to cover it.
One option that could help us (as a user) is to have an endpoint on config_server which internally polls /state/v1/config for app package version of all nodes/services and compares it with latest active app package version. This will help us to identify which node/services is still running on older package version. We could also do this manually on our side iteratively on all hosts but having it at config server level will be much more efficient.
Agreed.
This is available today (but undocumented) and does exactly what @mohsin36 suggests.
{
"services": [
... per service info omitted
],
"url": "https://localhost:19071/application/v2/tenant/default/application/default/environment/prod/region/default/instance/default/serviceconverge",
"currentGeneration": 2,
"wantedGeneration": 2,
"converged": true
}
Great. Maybe we could document this path for now, it'll be a while until we create a proper public cloud control plane API.
all right, I'll do that shortly
Thanks! We could also use this to create a convergence check command in vespa-cli @mpolden
This is already provided by the CLI. If you use vespa deploy --target local --wait <secs>
, CLI uses the API mentioned by @hmusum to wait for a successful deployment.
If you only want to verify an existing deployment, you can use vespa status --target local --wait <secs>
.
Great, vespa deploy does a lot more than before and I think that solves the problem for cloud deployments.
However, I don't think it allows you to verify an existing deployment since vespa status
returns success also if it's the previous package revision that is active.
However, I don't think it allows you to verify an existing deployment since vespa status returns success also if it's the previous package revision that is active.
Yes, but if you specify --wait N
it will always wait N seconds for the convergence API to report success before checking if the endpoint is up. If convergence API does not report a success within the timeout, an error is returned. The intention is that --wait
always does what you expect, regardless of target type.
$ vespa status --wait 5 --target local; echo $?
Waiting up to 5s for query service to become available ...
Error: service 'query' is unavailable: Get "http://127.0.0.1:19071/application/v2/tenant/default/application/default/environment/prod/region/default/instance/default/serviceconverge": dial tcp 127.0.0.1:19071: connect: connection refused
1
Yes, with a self-hosted deployment and deploying a non-working app over a working one:
zsh$ vespa status
Container (query API) at http://127.0.0.1:8080 is ready
zsh$ vespa status --wait 5
Waiting up to 5s for query service to become available ...
Error: service 'query' is unavailable: services have not converged
With cloud deployments behavior both will succeed in this scenario.
I think the behavior with --wait
on self-hosted is just wrong here - adding a wait time should not change the semantics of the operation, and we should instead add a new operation - vespa convergence? - to check for this, and have it work the same self-hosted and on cloud.
We should also have and document the convergence path (/application/v[2|4]/tenant/default/application/default/environment/prod/region/default/instance/default/serviceconverge) both for self-hosted and cloud.
Ok, instead of adding a new command for every little thing, how about we make status
more useful? Example:
$ vespa status --target local
Deployment [converged|not converged] on generation N
Endpoint at http://127.0.0.1:8080 is ready
$ vespa status --target cloud
Deployment [completed|running job N (console link to run)]
Endpoint at https://foo.vespa-app.cloud is ready
With cloud target it should list all available endpoints, thus fixing #26605.
I think --wait
should keep waiting for convergence+endpoint availability and that this should be fixed for cloud. If you think waiting separately should be possible we could add the options --deployment
and --endpoint
to status
that constrains what to show/wait for.
I really don't agree that adding --wait should fundamentally change what this command is doing, from returning whether some service is available (as is the documented functionality) to whether it has converged on some specific version.
Apart from that, having more output in status is fine, but I think there should be a specific, consistent criterion for returning success vs. failure so that it can also be used for automation.
If we allow for that then I think we need another command to check a different condition?
I really don't agree that adding --wait should fundamentally change what this command is doing, from returning whether some service is available (as is the documented functionality) to whether it has converged on some specific version.
The reason it works like this is because it's the behavior you want for vespa deploy
. Since --wait
is a global flag, status
inherits the same behavior.
Documentation for --wait
says "Number of seconds to wait for a service to become ready", where "ready" includes converged.
If we allow for that then I think we need another command to check a different condition?
We can just return success based on what you're checking:
vespa status
exits with 0 if both deployment has converged and endpoint(s) are upvespa status --deployment
exits with 0 if deployment has convergedvespa status --endpoint
exits with 0 if endpoint(s) are up.In practice I think most people want to use the first one.
I suggest:
vespa status
exists with 0 if the service is ready for use (which does not include convergence).
But the text can say that it has not converged to the latest version when appropriate (I suggest a yellow "Not converged:" followed by more details).
vespa convergence
exits with 0 if the service has converged on the latest version.
Documentation improved in https://github.com/vespa-engine/documentation/pull/2801.
https://github.com/vespa-engine/vespa/issues/27884 will add more improvements, closing this ticket. Thanks for reporting!
Next version of CLI adds the vespa status deployment
command which can be used to verify status of deployment.
Describe the bug Currently application package is only deployed to config_server and it returns successful response. When we deployed a package with a missing model file config server returned success however update crashed on all containers nodes with ONNX runtime error.
To Reproduce Delete contents of a model file to make it 0 bytes and then prepare/activate the package.
Expected behavior Expect to catch the failure with validation API at cluster level. This API should should return success/failure with nodes that failed to update to latest package.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Vespa version 8.137.25
Additional context Add any other context about the problem here.