vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.58k stars 586 forks source link

Config server zookeeper and prepare application issue #22315

Closed nehajatav closed 2 years ago

nehajatav commented 2 years ago

Describe the bug Symptom#1: Preparing session 3 using http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071/application/v2/tenant/default/session/3/prepared Request failed. HTTP status code: 504 Empty responsecommand terminated with exit code 1 Symptom#2: Zookeeper logs suggest zookeeper server not running

Related issues: 1) Zookeeper warnings suggesting Zk is not up: https://github.com/vespa-engine/vespa/issues/21946 2) Since configuration is not successfully deployed, is the config proxy and config sentinel are expected to give errors like this? https://github.com/vespa-engine/vespa/issues/22291

To Reproduce This happens on certain k8s clusters consistently, but doesnt happen on others for the same application.zip and other cluster configs same. Maybe this can be reproduced if the Zk part of config server is intentionally brought down? Command: /opt/vespa/bin/vespa-deploy prepare /tmp/application.zip

Expected behavior 1) application deploy should be successful 2) zookeeper should be up

Environment (please complete the following information):

Vespa version vespa version 7.559.12 compiled with go1.16.13 on linux/amd64

Troubleshooting Information Command: /opt/vespa/bin/vespa-deploy prepare /tmp/application.zip Output: Uploading application '/tmp/application.zip' using http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071/application/v2/tenant/default/session Session 3 for tenant 'default' created. Preparing session 3 using http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071/application/v2/tenant/default/session/3/prepared Request failed. HTTP status code: 504 Empty responsecommand terminated with exit code 1

Command: /opt/vespa/bin/vespa-deploy -v -t 1800 prepare /tmp/application.zip Output: #Prepare exited without a response Uploading application '/tmp/application.zip' using http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071/application/v2/tenant/default/session?verbose=true {"log":[],"tenant":"default","session-id":"8","prepared":"http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071/application/v2/tenant/default/session/8/prepared","content":"http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071/application/v2/tenant/default/session/8/content/","message":"Session 8 for tenant 'default' created."} Session 8 for tenant 'default' created. Using config server URL http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071 read from file Preparing session 8 using http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071/application/v2/tenant/default/session/8/prepared?verbose=true&timeout=1800 -bash-4.4$ /opt/vespa/bin/vespa-deploy -v -t 1800 activate Using config server URL http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071 read from file Activating session 8 using http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071/application/v2/tenant/default/session/8/active?verbose=true&timeout=1800 Request failed. HTTP status code: 400 {"error-code":"BAD_REQUEST","message":"Session 8 is not prepared"} Session 8 is not prepared command terminated with exit code 1

bash-4.2$ curl http://localhost:19071/state/v1/health { "time" : 1651058362751, "status" : { "code" : "up" }, "metrics" : { "snapshot" : { "from" : 1.651058302018E9, "to" : 1.651058362018E9 }, "values" : [ { "name" : "requestsPerSecond", "values" : { "count" : 4, "rate" : 0.06666666666666667 } }, { "name" : "latencySeconds", "values" : { "average" : 7.5E-4, "sum" : 0.003, "count" : 4, "last" : 0.0, "max" : 22.388, "min" : 0.0, "rate" : 0.06666666666666667 } } ] } }bash-4.2$

Zookeeper logs: 1651067470.226501 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/32 - .org.apache.zookeeper.ClientCnxnSocketNetty warning future isn't success.\nexception=\nio.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: vespa-0.vespa-internal.mynamespace.svc.cluster.local/11.1.3.228:2181\nCaused by: java.net.ConnectException: Connection refused\n\tat java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n\tat java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)\n\tat io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)\n\tat io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n

1651050081.939925 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/40 - .org.apache.zookeeper.server.NettyServerCnxnFactory warning Zookeeper server is not running, close the connection before starting the TLS handshake 1651050083.433324 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/28 - .org.apache.zookeeper.ClientCnxn warning Session 0x0 for sever vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.\nexception=\nEndOfStreamException: channel for sessionid 0x0 is lost\n\tat org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:285)\n\tat org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1280)\n

Configserver logs: bash-4.2$ /opt/vespa/bin/vespa-logfmt -l error,fatal,warning|grep -v configproxy|grep -v sentinel

[2022-04-27 07:43:26.535] ERROR : configserver Container.com.yahoo.protect.Process Timed out waiting for application shutdown. Please check that all your request handlers drain their request content channels. [2022-04-27 09:02:53.880] WARNING : configserver Container.com.yahoo.vespa.config.server.monitoring.ZKMetricUpdater Failure in retrieving monitoring data: (SocketTimeoutException) Read timed out

Other references interpretation https://docs.vespa.ai/en/operations/configuration-server.html#zookeeper-barrier-timeout: I believe this property doesnt have significance since we have single Zk node?

Config Access logs Only two types of logs seen {"ip":"","time":1651061644.575,"duration":22.854,"responsesize":401,"requestsize":3949418,"code":200,"method":"POST","uri":"/application/v2/tenant/default/session","version":"HTTP/1.1","agent":"vespa-deploy","host":"vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071","scheme":"http","localport":19071,"peeraddr":"","peerport":44548} {"ip":"11.1.2.1","time":1651061680.375,"duration":0.001,"responsesize":598,"requestsize":0,"code":200,"method":"GET","uri":"/state/v1/health","version":"HTTP/1.1","agent":"kube-probe/1.21","host":":19071","scheme":"http","localport":19071,"peeraddr":"11.1.2.1","peerport":53894}

Resources already referred to: https://docs.vespa.ai/en/cloudconfig/application-packages.html#file-distribution https://docs.vespa.ai/en/operations/configuration-server.html https://github.com/vespa-engine/sample-apps/tree/master/examples/operations/multinode https://github.com/vespa-engine/sample-apps/blob/master/examples/operations/README.md#troubleshooting - already tried to set service spec.publishNotReadyAddresses: true

Prepare Troubleshooting: https://docs.vespa.ai/en/reference/application-packages-reference.html

/opt/vespa/bin/vespa-deploy prepare /tmp/application1.zip && /opt/vespa/bin/vespa-deploy activate' Uploading application '/tmp/application1.zip' using http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071/application/v2/tenant/default/session Session 5 for tenant 'default' created. Preparing session 5 using http://vespa-0.vespa-internal.mynamespace.svc.cluster.local:19071/application/v2/tenant/default/session/5/prepared Request failed. HTTP status code: 400 Invalid application package: Error loading default.default: For schema 'aycolexdocumentslexDocument', field 'downloadurl_s_attr': For expression 'attribute downloadurl_s_attr': Can not assign Array to attribute 'downloadurl_s_attr' which is Array. command terminated with exit code 1

WATCHER::

WatchedEvent state:SyncConnected type:None path:null [main-SendThread(localhost:2181)] WARN org.apache.zookeeper.ClientCnxn - An exception was thrown while closing send thread for session 0x1d23149c0009. EndOfStreamException: channel for sessionid 0x1d23149c0009 is lost at org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:285) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1280)

WATCHER::

WatchedEvent state:Closed type:None path:null [main] ERROR org.apache.zookeeper.util.ServiceUtils - Exiting JVM with code 0 bash-4.2$

bash-4.2$ /opt/vespa/bin/vespa-zkcli ls /config/v2/tenants/default/sessions/2/userapp Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null Node does not exist: /config/v2/tenants/default/sessions/2/userapp [main] ERROR org.apache.zookeeper.util.ServiceUtils - Exiting JVM with code 1

Additional context Config sentinal logs (i think these are expected): [2022-04-27 11:23:19.464] WARNING : config-sentinel sentinel.config.frt.frtconnection Connection to tcp/localhost:19090 failed or timed out [2022-04-27 11:23:19.608] WARNING : config-sentinel sentinel.config.frt.frtconnection FRT Connection tcp/localhost:19090 suspended until 2022-04-27 11:28:19.608 UTC [2022-04-27 11:23:28.553] WARNING : config-sentinel sentinel.config-sentinel Timeout getting config, please check your setup. Will exit and restart: Timed out while subscribing to 'cloud.config.sentinel', configid 'hosts/vespa-0.vespa-internal.mynamespace.svc.cluster.local'

Configproxy logs: [2022-04-27 05:29:06.823] WARNING : configproxy configproxy.com.yahoo.config.subscription.impl.JRTConfigRequester Request callback failed: APPLICATION_NOT_LOADED. Connection spec: tcp/vespa-0.vespa-internal.mynamespace.svc.cluster.local:19070, error message: Failed request (No application exists) from Connection { Socket[addr=/,port=40026,localport=19070] } [2022-04-27 05:29:55.911] WARNING : configproxy stderr com.yahoo.config.subscription.ConfigInterruptedException: java.lang.InterruptedException: sleep interrupted [2022-04-27 05:29:56.064] WARNING : configproxy stderr \tat com.yahoo.config.subscription.impl.JRTConfigSubscription.subscribe(JRTConfigSubscription.java:159) [2022-04-27 05:29:56.265] WARNING : configproxy stderr \tat com.yahoo.config.subscription.ConfigSubscriber.subscribeAndHandleErrors(ConfigSubscriber.java:130) [2022-04-27 05:29:56.443] WARNING : configproxy stderr \tat com.yahoo.config.subscription.impl.GenericConfigSubscriber.subscribe(GenericConfigSubscriber.java:46) [2022-04-27 05:29:56.596] WARNING : configproxy stderr \tat com.yahoo.vespa.config.proxy.Subscriber.subscribe(Subscriber.java:36) [2022-04-27 05:29:56.749] WARNING : configproxy stderr \tat com.yahoo.vespa.config.proxy.RpcConfigSourceClient.subscribeToConfig(RpcConfigSourceClient.java:155) [2022-04-27 05:29:56.903] WARNING : configproxy stderr \tat com.yahoo.vespa.config.proxy.RpcConfigSourceClient.getConfig(RpcConfigSourceClient.java:142) [2022-04-27 05:29:57.056] WARNING : configproxy stderr \tat com.yahoo.vespa.config.proxy.ProxyServer.resolveConfig(ProxyServer.java:79) [2022-04-27 05:29:57.208] WARNING : configproxy stderr \tat com.yahoo.vespa.config.proxy.ConfigProxyRpcServer.getConfigImpl(ConfigProxyRpcServer.java:258) [2022-04-27 05:29:57.361] WARNING : configproxy stderr \tat com.yahoo.vespa.config.proxy.ConfigProxyRpcServer.lambda$getConfigV3$1(ConfigProxyRpcServer.java:128) [2022-04-27 05:29:57.516] WARNING : configproxy stderr \tat com.yahoo.vespa.config.proxy.ConfigProxyRpcServer.lambda$dispatchRpcRequest$13(ConfigProxyRpcServer.java:230) [2022-04-27 05:29:57.833] WARNING : configproxy stderr \tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [2022-04-27 05:29:57.986] WARNING : configproxy stderr \tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [2022-04-27 05:29:58.139] WARNING : configproxy stderr \tat java.base/java.lang.Thread.run(Thread.java:829) [2022-04-27 05:29:58.292] WARNING : configproxy stderr Caused by: java.lang.InterruptedException: sleep interrupted [2022-04-27 05:29:58.445] WARNING : configproxy stderr \tat java.base/java.lang.Thread.sleep(Native Method) [2022-04-27 05:29:58.598] WARNING : configproxy stderr \tat com.yahoo.config.subscription.impl.JRTConfigSubscription.subscribe(JRTConfigSubscription.java:157) [2022-04-27 05:29:58.766] WARNING : configproxy stderr \t... 12 more

nehajatav commented 2 years ago

Good ZK logs: 1651049424.222358 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/1 - .org.apache.zookeeper.ZooKeeper info Initiating client connection, connectString=vespa-0.vespa-internal.mynamespace.svc.cluster.local:2181 sessionTimeout=120000 watcher=org.apache.curator.ConnectionState@6cd7dc74 1651049424.442502 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/24 - .org.apache.zookeeper.server.NettyServerCnxnFactory info binding to port /0.0.0.0:2181 1651049424.477040 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/24 - .org.apache.zookeeper.server.NettyServerCnxnFactory info bound to port 2181 1651049425.419851 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/27 - .org.apache.zookeeper.ClientCnxn info Opening socket connection to server vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181. 1651049425.423310 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/32 - .org.apache.zookeeper.ClientCnxn info Socket connection established, initiating session, client: /:57002, server: vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181 1651049425.432135 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/32 - .org.apache.zookeeper.ClientCnxnSocketNetty info channel is connected: [id: 0x1ce60929, L:/:57002 - R:vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181] 1651049425.487202 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/32 - .org.apache.zookeeper.ClientCnxn info Session establishment complete on server vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181, session id = 0x692f82da00000, negotiated timeout = 120000 ... 1651073044.459850 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/24 - .org.apache.zookeeper.server.NettyServerCnxnFactory info configure org.apache.zookeeper.server.VespaNettyServerCnxnFactory@14a3524e secure: false on addr /0.0.0.0:2181

1651049424.749450 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/0 - .org.apache.zookeeper.server.quorum.QuorumPeer info LEADING 1651049424.785559 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/0 - .org.apache.zookeeper.server.quorum.Leader info LEADING - LEADER ELECTION TOOK - 242 MS

Bad Zk logs: 1651073044.459850 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/24 - .org.apache.zookeeper.server.NettyServerCnxnFactory info configure org.apache.zookeeper.server.VespaNettyServerCnxnFactory@14a3524e secure: false on addr /0.0.0.0:2181 1651073059.514803 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/24 - .org.apache.zookeeper.server.NettyServerCnxnFactory info binding to port /0.0.0.0:2181 1651073059.585957 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/28 - .org.apache.zookeeper.ClientCnxn info Opening socket connection to server vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181. 1651073059.708349 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/24 - .org.apache.zookeeper.server.NettyServerCnxnFactory info bound to port 2181 1651073060.130911 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/31 - .org.apache.zookeeper.ClientCnxn info Socket connection established, initiating session, client: /:44040, server: vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181 1651073060.616089 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/31 - .org.apache.zookeeper.ClientCnxnSocketNetty info channel is connected: [id: 0x685c321b, L:/:44040 - R:vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181] ... 1651073061.011869 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/31 - .org.apache.zookeeper.ClientCnxnSocketNetty info channel is disconnected: [id: 0x685c321b, L:/:44040 ! R:vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181] 1651073061.802125 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/28 - .org.apache.zookeeper.ClientCnxn warning Session 0x0 for sever vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.\nexception=\nEndOfStreamException: channel for sessionid 0x0 is lost\n\tat org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:285)\n\tat org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1280)\n ... </this happens a few times> 1651073069.959123 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/28 - .org.apache.zookeeper.ClientCnxn info Opening socket connection to server vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181. 1651073070.112922 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/31 - .org.apache.zookeeper.ClientCnxn info Socket connection established, initiating session, client: /:44302, server: vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181 1651073070.189050 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/31 - .org.apache.zookeeper.ClientCnxnSocketNetty info channel is connected: [id: 0x0949719b, L:/:44302 - R:vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181] 1651073070.549492 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/31 - .org.apache.zookeeper.ClientCnxn info Session establishment complete on server vespa-0.vespa-internal.mynamespace.svc.cluster.local/:2181, session id = 0x1dd558380000, negotiated timeout = 120000

1651073063.352647 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/0 - .org.apache.zookeeper.server.quorum.QuorumPeer info LEADING 1651073068.108425 vespa-0.vespa-internal.mynamespace.svc.cluster.local 440/0 - .org.apache.zookeeper.server.quorum.Leader info LEADING - LEADER ELECTION TOOK - 5673 MS

kkraune commented 2 years ago

Thanks for a detailed report!

Invalid application package: Error loading default.default: For schema 'aycolexdocumentslexDocument', field 'downloadurl_s_attr': For expression 'attribute downloadurl_s_attr': Can not assign Array to attribute 'downloadurl_s_attr' which is Array.

This is a problem you must resolve in the schema before you deploy prepare (I don't have access to this schema, please provide relevant snippet)

The second problem seems to be zookeeper cluster run in the config servers. As a workaround, you can run with one config server to see if that resolves the problem - then work on the actual problem, maybe based on https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/ . The Vespa Team does not deploy on Kubernetes, we are grateful for contributions to improve this if you are able to work it out

nehajatav commented 2 years ago

@kkraune regarding the below, as mentioned in my issue, this was an experiment to understand which step of prepare was timing out. Ofcourse the schema used there is NOT the schema we are trying with finally

This is a problem you must resolve in the schema before you deploy prepare

nehajatav commented 2 years ago

you can run with one config server

Above reported issue is using single config server itself

nehajatav commented 2 years ago

So I think in some cases the Zookeeper takes several minutes to form a cluster but seems by that time config server gives up on it? Is that possible? Which part of the code should we look into to see zookeeper interactions of configserver?

bash-4.2$ echo "Are you ok? $(echo ruok | nc 127.0.0.1 2181)" Are you ok? imok bash-4.2$ echo srvr | nc localhost 2181 | grep Mode Mode: leader bash-4.2$ echo srvr | nc localhost 2181 Zookeeper version: 3.7.0-e3704b390a6697bfdf4b0bef79e3da7a4f6bac4b, built on 2021-03-17 09:46 UTC Latency min/avg/max: 0/0.3389/261 Received: 11975 Sent: 13868 Connections: 2 Outstanding: 0 Zxid: 0x1000011a0 Mode: leader Node count: 41 Proposal sizes last/min/max: 145/36/459 bash-4.2$

kkraune commented 2 years ago

so let me summarize: You are running a Vespa cluster with one config server, and the config server does not start successfully.

Please run https://docs.vespa.ai/en/vespa-quick-start-kubernetes.html (but using your environment) to make sure your environment is right (It says 8G memory, but there could be other restrictions like processes limit).

kkraune commented 2 years ago

also try https://github.com/vespa-engine/sample-apps/tree/master/examples/operations/basic-search-on-gke

nehajatav commented 2 years ago

so let me summarize: You are running a Vespa cluster with one config server, and the config server does not start successfully.

Config server came up just fine, you can see the output of ApplicationStatus in the issue details. The only difference I see is a delayed zookeeper startup in the case where application deploy is timing out

kkraune commented 2 years ago

https://docs.vespa.ai/en/operations/configuration-server.html#start-sequence is updated, please give it a go.

Summary: your single config server starts up just fine, but deploy times out? Please put the exact steps you run here, value of VESPA_CONFIGSERVERS and services/hosts.xml please

nehajatav commented 2 years ago

Thanks @kkraune . Summary is accurate. Also the same docker image and application.zip works perfectly on some other k8s clusters we have but doesn't work on some. Providing the details you asked for below

Steps run: /opt/vespa/bin/vespa-start-configserver /opt/vespa/bin/vespa-start-services

/opt/vespa/bin/vespa-deploy prepare /tmp/application.zip && /opt/vespa/bin/vespa-deploy activate VESPA_CONFIGSERVERS: "vespa-0.vespa-internal.mynamespace.svc.cluster.local" hosts.xml ` admin-0 container-0 container-1 container-2 container-3 container-4 container-5 content-0 content-1 content-2 content-3 content-4 content-5 content-6 content-7 content-8 content-9 content-10 content-11 content-12 content-13 content-14 content-15 content-16 content-17 ` services.xml ` false 3 0.90 3 0.7 16 8 8 3600000 `
nehajatav commented 2 years ago

https://docs.vespa.ai/en/operations/configuration-server.html#start-sequence is updated, please give it a go.

Both highlighted lines are present in configserver logs: [2022-04-29 11:33:10.052] INFO : configserver Container.com.yahoo.vespa.zookeeper.ZooKeeperRunner Starting ZooKeeper server with /opt/vespa/conf/zookeeper/zookeeper.cfg. Trying to establish ZooKeeper quorum (members: [vespa-0.vespa-internal.mynamespace.svc.cluster.local], attempt 1) [2022-04-29 11:33:47.561] INFO : configserver Container.com.yahoo.container.handler.threadpool.DefaultContainerThreadpool Threadpool 'default-pool': min=8, max=200, queue=0 [2022-04-29 11:35:09.765] INFO : configserver Container.com.yahoo.vespa.config.server.tenant.TenantRepositoryAdding tenant 'default', created 2022-04-29T11:35:09.230Z. Bootstrapping in PT0.521946S [2022-04-29 11:35:10.322] INFO : configserver Container.com.yahoo.vespa.config.server.rpc.RpcServer Rpc server will listen on port 19070 [2022-04-29 11:35:10.713] INFO : configserver Container.com.yahoo.vespa.config.server.ConfigServerBootstrap Config server upgrading from to 7.559.12. Redeploying all applications [2022-04-29 11:35:12.590] INFO : configserver Container.com.yahoo.vespa.config.server.ConfigServerBootstrap Redeploying 0 apps [] with 4 threads [2022-04-29 11:35:12.902] INFO : configserver Container.com.yahoo.vespa.config.server.ConfigServerBootstrap All applications redeployed successfully [2022-04-29 11:35:13.055] INFO : configserver Container.com.yahoo.container.jdisc.state.StateMonitor Changing health status code from 'initializing' to 'up' [2022-04-29 11:35:13.239] INFO : configserver Container.com.yahoo.jdisc.http.server.jetty.Janitor Creating janitor executor with 2 threads [2022-04-29 11:35:13.422] INFO : configserver Container.com.yahoo.jdisc.http.server.jetty.JettyHttpServer Threadpool size: min=18, max=18 [2022-04-29 11:35:13.788] INFO : configserver Container.org.eclipse.jetty.server.Server jetty-9.4.45.v20220203; built: 2022-02-03T09:14:34.105Z; git: 4a0c91c0be53805e3fcffdcdcc9587d5301863db; jvm 11.0.14.1+1-LTS [2022-04-29 11:35:13.964] INFO : configserver Container.org.eclipse.jetty.server.handler.ContextHandler Started o.e.j.s.ServletContextHandler@5a6d4dee{19071,/,null,AVAILABLE} [2022-04-29 11:35:15.155] INFO : configserver Container.org.eclipse.jetty.server.AbstractConnector Started configserver@4a0ee0ba{HTTP/1.1, (http/1.1, h2c)}{0.0.0.0:19071} [2022-04-29 11:35:16.404] INFO : configserver Container.org.eclipse.jetty.server.Server Started @191117ms [2022-04-29 11:35:16.559] INFO : configserver Container.com.yahoo.container.jdisc.ConfiguredApplication Switching to the latest deployed set of configurations and components. Application config generation: 0 bash-4.2$

kkraune commented 2 years ago

Thanks. So we are now at this problem I guess then:

/opt/vespa/bin/vespa-deploy -v -t 1800 prepare /tmp/application.zip Output: #Prepare exited without a response

when you run this, does it wait for 1800 seconds or return faster? what are the logs on the config server when this happens?

kkraune commented 2 years ago

also - is https://github.com/vespa-engine/vespa/issues/22291 the same issue as this?

nehajatav commented 2 years ago

also - is #22291 the same issue as this?

Yep same issue - part of same poc

when you run this, does it wait for 1800 seconds or return faster? what are the logs on the config server when this happens?

it returns faster than 1800s but without any console output (not even "Empty responsecommand terminated with exit code 1"). That's the thing, config server logging is not helpful, no logs at all during prepared api call

kkraune commented 2 years ago

> Also the same docker image and application.zip works perfectly on some other k8s clusters we have but doesn't work on some

I am not sure if we are able to provide more help, as it seems both vespa software and your configuration is OK.

Maybe the best bet is to look into the differences in environment where it fails / is OK. Also see #22085.

I suggest you try running on a stronger node with more resources and see if that works.

nehajatav commented 2 years ago

Update: @107dipan tried to increase the zookeeper barrier timeout and it helped deploy successfully but only intermittently. Can you help is understand the documentation around the same? What does 'servers' refer to in the below? Are they the zookeeper server in a multi node zk cluster or are they the other nodes in the vespa cluster like content/containers?

If the config servers are heavily loaded, or the applications being deployed are big, the internals of the server may time out when synchronizing with the other servers during deploy

kkraune commented 2 years ago

servers in this context is configuration servers, that again embed the zookeeper servers.

nehajatav commented 2 years ago

So in single config server cluster, this property is not relevant then right?

kkraune commented 2 years ago

I am no Zookeeper expert, I assumet not ;-). Anyway a good idea to allocate enough resources to config servers to avoid timeouts

nehajatav commented 2 years ago

We were able to get to the root cause of this issue by enabling logs of certain classes explicitly - it was due to very high RW latency to persistent volume in the clusters that had this issue. Our Takeaway: logctl does NOT enable logging on all classes. You may want to put some kind of warning in your documentation

component-specification specifies which sub-components of the service should be controlled. If empty, all components are controlled

kkraune commented 2 years ago

all right! It is helpful if you can include some more details, i.e. how you want the doc updated, you can even make a PR so this is a smaller problem for the next users. Or paste here and I will update