vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.62k stars 589 forks source link

configserver not working after pod was restarted #24333

Closed shubh9194 closed 1 year ago

shubh9194 commented 1 year ago

Describe the bug We have a Vespa setup in kubernetes with 1 config node with multiple container and content nodes. our configserver pode was restarted after which we started seeing the following error msg while trying to access cluster controller upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 In logs, we see the error configproxy configproxy.com.yahoo.config.subscription.impl.JRTConfigRequester Request callback failed: APPLICATION_NOT_LOADED. Connection spec: tcp/vespa-1016-pvc-new-0.vespa-1016-pvc-new-internal.d190259.svc.cluster.local:19070, error message: Failed request (No application exists) from Connection { Socket[addr=/11.3.12.78,port=37864,localport=19070] }

Any container that is restarted after restarting config node gave the same error while trying to run the search query.

Also container which was not restarted give different generation curl -s http://localhost:8080/state/v1/config { "config" : { "generation" : 11, "container" : { "generation" : 11 } }

whereas configserver after restarting the pod trying to use generation 0 instead of 11 which is being used in pods that were not restarted after configserver pod was restarted. curl -s http://localhost:19071/state/v1/config { "config" : { "generation" : 0, "container" : { "generation" : 0 } } }

To Reproduce Steps to reproduce the behavior:

  1. setup the cluster in kubernetes
  2. Restart the pod running config server
  3. See error

Expected behavior cluster should have returned to previous state after pod was restarted.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Vespa version 7.589

Additional context Add any other context about the problem here.

jobergum commented 1 year ago

You have lost the storage of the configuration server, and no app has been deployed to the freshly started configuration server, which is started on empty storage. That is what this, in short, means.

APPLICATION_NOT_LOADED.

This means the sentinel asks the configuration server, which informs that there is no application.

shubh9194 commented 1 year ago

we are using NAS storage. we are mounting /opt/vespa/var/db to NAS. Is there anything else we need to do? how do we make sure configserver comes up with application loaded whenever pod is restarted. what is the expected behaviour of application deploy status in case all zookeeper nodes are shutdown and then restored without disk failure