No way to validate if application package deployment is really successful

mohsin36 commented 1 year ago

Describe the bug Currently application package is only deployed to config_server and it returns successful response. When we deployed a package with a missing model file config server returned success however update crashed on all containers nodes with ONNX runtime error.

To Reproduce Delete contents of a model file to make it 0 bytes and then prepare/activate the package.

Expected behavior Expect to catch the failure with validation API at cluster level. This API should should return success/failure with nodes that failed to update to latest package.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: "Red Hat Enterprise Linux"
Infrastructure: self-hosted using podman/docker
Versions: "8.6 (Ootpa)"

Vespa version 8.137.25

Additional context Add any other context about the problem here.

kkraune commented 1 year ago

This is a great question! Ideally, Vespa could do all validations at deploy-time - in practice, this is hard as artifacts used in the Vespa Container can be arbitrarily large, with custom code.

Vespa Cloud solves this with https://cloud.vespa.ai/en/automated-deployments - deployments are run through system and staging tests, that will trap such errors. This way, one can safely deploy - the application owners can easily provide tests that validate not only startup but also complex test cases validating the application logic, before deploying to prod systems.

kkraune commented 1 year ago

some changes to doc in https://github.com/vespa-engine/documentation/pull/2687

bratseth commented 1 year ago

I think we can do a lot more here and afaik we don't have another issue to cover it.

mohsin36 commented 1 year ago

One option that could help us (as a user) is to have an endpoint on config_server which internally polls /state/v1/config for app package version of all nodes/services and compares it with latest active app package version. This will help us to identify which node/services is still running on older package version. We could also do this manually on our side iteratively on all hosts but having it at config server level will be much more efficient.

bratseth commented 1 year ago

Agreed.

hmusum commented 1 year ago

This is available today (but undocumented) and does exactly what @mohsin36 suggests.

Example: https://localhost:19071/application/v2/tenant/default/application/default/environment/prod/region/default/instance/default/serviceconverge | jq

{
  "services": [
    ... per service info omitted
  ],
  "url": "https://localhost:19071/application/v2/tenant/default/application/default/environment/prod/region/default/instance/default/serviceconverge",
  "currentGeneration": 2,
  "wantedGeneration": 2,
  "converged": true
}

bratseth commented 1 year ago

Great. Maybe we could document this path for now, it'll be a while until we create a proper public cloud control plane API.

kkraune commented 1 year ago

all right, I'll do that shortly

bratseth commented 1 year ago

Thanks! We could also use this to create a convergence check command in vespa-cli @mpolden

mpolden commented 1 year ago

This is already provided by the CLI. If you use vespa deploy --target local --wait <secs>, CLI uses the API mentioned by @hmusum to wait for a successful deployment.

If you only want to verify an existing deployment, you can use vespa status --target local --wait <secs>.

bratseth commented 1 year ago

Great, vespa deploy does a lot more than before and I think that solves the problem for cloud deployments.

However, I don't think it allows you to verify an existing deployment since vespa status returns success also if it's the previous package revision that is active.

mpolden commented 1 year ago

However, I don't think it allows you to verify an existing deployment since vespa status returns success also if it's the previous package revision that is active.

Yes, but if you specify --wait N it will always wait N seconds for the convergence API to report success before checking if the endpoint is up. If convergence API does not report a success within the timeout, an error is returned. The intention is that --wait always does what you expect, regardless of target type.

$ vespa status --wait 5 --target local; echo $?
Waiting up to 5s for query service to become available ...
Error: service 'query' is unavailable: Get "http://127.0.0.1:19071/application/v2/tenant/default/application/default/environment/prod/region/default/instance/default/serviceconverge": dial tcp 127.0.0.1:19071: connect: connection refused
1

bratseth commented 1 year ago

Yes, with a self-hosted deployment and deploying a non-working app over a working one:

zsh$ vespa status         
Container (query API) at http://127.0.0.1:8080 is ready
zsh$ vespa status --wait 5
Waiting up to 5s for query service to become available ...
Error: service 'query' is unavailable: services have not converged

With cloud deployments behavior both will succeed in this scenario.

I think the behavior with --wait on self-hosted is just wrong here - adding a wait time should not change the semantics of the operation, and we should instead add a new operation - vespa convergence? - to check for this, and have it work the same self-hosted and on cloud.

We should also have and document the convergence path (/application/v[2|4]/tenant/default/application/default/environment/prod/region/default/instance/default/serviceconverge) both for self-hosted and cloud.

mpolden commented 1 year ago

Ok, instead of adding a new command for every little thing, how about we make status more useful? Example:

$ vespa status --target local
Deployment [converged|not converged] on generation N
Endpoint at http://127.0.0.1:8080 is ready

$ vespa status --target cloud
Deployment [completed|running job N (console link to run)]
Endpoint at https://foo.vespa-app.cloud is ready

With cloud target it should list all available endpoints, thus fixing #26605.

I think --wait should keep waiting for convergence+endpoint availability and that this should be fixed for cloud. If you think waiting separately should be possible we could add the options --deployment and --endpoint to status that constrains what to show/wait for.

bratseth commented 1 year ago

I really don't agree that adding --wait should fundamentally change what this command is doing, from returning whether some service is available (as is the documented functionality) to whether it has converged on some specific version.

Apart from that, having more output in status is fine, but I think there should be a specific, consistent criterion for returning success vs. failure so that it can also be used for automation.

If we allow for that then I think we need another command to check a different condition?

mpolden commented 1 year ago

I really don't agree that adding --wait should fundamentally change what this command is doing, from returning whether some service is available (as is the documented functionality) to whether it has converged on some specific version.

The reason it works like this is because it's the behavior you want for vespa deploy. Since --wait is a global flag, status inherits the same behavior.

Documentation for --wait says "Number of seconds to wait for a service to become ready", where "ready" includes converged.

If we allow for that then I think we need another command to check a different condition?

We can just return success based on what you're checking:

vespa status exits with 0 if both deployment has converged and endpoint(s) are up
vespa status --deployment exits with 0 if deployment has converged
vespa status --endpoint exits with 0 if endpoint(s) are up.

In practice I think most people want to use the first one.

bratseth commented 1 year ago

I suggest:

vespa status exists with 0 if the service is ready for use (which does not include convergence). But the text can say that it has not converged to the latest version when appropriate (I suggest a yellow "Not converged:" followed by more details).

vespa convergence exits with 0 if the service has converged on the latest version.

kkraune commented 1 year ago

Documentation improved in https://github.com/vespa-engine/documentation/pull/2801.

https://github.com/vespa-engine/vespa/issues/27884 will add more improvements, closing this ticket. Thanks for reporting!

mpolden commented 1 year ago

Next version of CLI adds the vespa status deployment command which can be used to verify status of deployment.

vespa-engine / vespa

No way to validate if application package deployment is really successful #27208