Cluster state is healthy but container unable to search with error 'Connection failure on nodes with distribution-keys: 0'

nehajatav commented 9 months ago

Bug

The container was unable to connect to searchnode and searches failed with error 'Connection failure on nodes with distribution-keys: 0'
The storage was healthy as per clustercontroller yet this was the behaviour
vespa-get-cluster-state -c testdoc -> Cluster testdoc is . Too few nodes available.
Tried vespa-sentinel-cmd restart searchnode - still the same after ensuring node was back up from clustercontroller
Tried vespa-sentinel-cmd restart container - still the same after container restart
Finally the below steps worked which results in cluster downtime vespa-stop-services vespa-stop-configserver vespa-start-configserver vespa-start-services

To Reproduce We have a single node instance where there seems to be a network blip which resulted in this kind of a scenario where there was a connection failure

`Node event log for storage.0: Date (UTC) Type Node Bucket space Event 2023-10-07 05:54:44.809 CURRENT storage.0 - Altered node state in cluster state from 'U, b 11' to 'U, t 1696658082, b 11' 2023-10-07 05:54:44.808 CURRENT storage.0 - Altered min distribution bit count from 16 to 11 2023-10-07 05:54:44.808 REPORTED storage.0 - Now reporting state U, t 1696658082, b 11 2023-10-07 05:54:43.804 REPORTED storage.0 - Node got back into slobrok with same address as before: tcp/my.host.name:19106 2023-10-07 05:50:29.332 REPORTED storage.0 - Node is no longer in slobrok. No pending state request to node. 2023-10-07 05:50:29.326 CURRENT storage.0 - Altered min distribution bit count from 11 to 16 2023-10-07 05:50:29.326 REPORTED storage.0 - Failed to get node state: D: Connection error: java.net.SocketException: Connection reset 2023-10-07 05:50:29.326 REPORTED storage.0 - Failed get node state request: Connection error: java.net.SocketException: Connection reset 2023-10-06 11:34:46.325 CURRENT storage.0 - Altered node state in cluster state from 'U, t 1696592081, b 11' to 'U, b 11' 2023-10-06 11:34:46.105 CURRENT storage.0 - Altered node state in cluster state from 'M: Connection error: java.net.SocketException: Connection reset' to 'U, t 1696592081, b 11'

cluster events: Date (UTC) Type Node Bucket space Event 2023-10-07 05:54:44.809 SYSTEMSTATE - - New cluster state version 10. Change from last: version: 9 => 10, storage: 0: startTimestamp: 0 => 1696658082 2023-10-07 05:54:44.809 CURRENT storage.0 - Altered node state in cluster state from 'U, b 11' to 'U, t 1696658082, b 11' 2023-10-07 05:54:44.808 CURRENT storage.0 - Altered min distribution bit count from 16 to 11 2023-10-07 05:54:43.804 REPORTED storage.0 - Node got back into slobrok with same address as before: tcp/my.host.name:19106 2023-10-07 05:50:29.332 REPORTED storage.0 - Node is no longer in slobrok. No pending state request to node. 2023-10-07 05:50:29.326 CURRENT storage.0 - Altered min distribution bit count from 11 to 16 2023-10-07 05:50:14.339 MASTER_ELECTION - - This node is no longer fleetcontroller master. 2023-10-06 11:34:46.325 SYSTEMSTATE - - New cluster state version 9. Change from last: version: 8 => 9, storage: 0: startTimestamp: 1696592081 => 0 2023-10-06 11:34:46.325 CURRENT storage.0 - Altered node state in cluster state from 'U, t 1696592081, b 11' to 'U, b 11' `

Expected behavior

The searches should have gone through given state of cluster was healthy
The cluster should be restored in such scenarios with zero downtime - how can that be done?

Logs Attaching the below searchnode.log container.log tracelevelsearch.txt

Environment :

OS: RHEL8
Infrastructure: Podman

Vespa version 8.221.29

kkraune commented 9 months ago

Thanks for reporting. My understanding is, this is a single-node system, where a content node is unavailable 05:50:29 - 05:54:44. Queries fail after node is up again, until all the Vespa processes are restarted.

I think you must analyze the vespa.log file to understand why the content node is offline, and what it logs after reporting up again after 05:54:44. Something made it stop at 05:50:29

From the log messages, it seems like your file system is both slow WARNING searchnode proton.persistence.util Slow processing of message GetIterCommand(). Processing time: 117.2 s (>=5.0 s) and maybe even losing files WARNING searchnode proton.searchlib.docstore.logdatastore We detected an empty idx file for part '/opt/vespa/var/db/vespa/search/cluster.testdoc/n0/documents/s1testdocument/1.rem oved/summary/1696591907655814983'. Erasing it.

Maybe you are overloading the cluster.

Generally, you cannot avoid cluster downtime with a single-node system.

nehajatav commented 9 months ago

Agree with last statement, but even if it was a multi-node system (it will be the case for our QA and higher envs), inability of container to connect to healthy content node will result in availability loss. Content node was not offline, the pid dint change in vespa.log. I suspect momentary network blip that may have caused connection reset. Disk is local, dont expect any file loss either on a sat when no ones using the env

cc: @bratseth

kkraune commented 9 months ago

I agree with your statements, the container node should and will try to reconnect. In the attached container logs we see

[2023-10-07 05:50:29.328] INFO container Container.com.yahoo.search.dispatch.rpc.RpcPing Pong 65946 from node 0 in group 0 with hostname my.host.name received too late, latest is 65947

[2023-10-07 05:58:38.849] WARNING container Container.com.yahoo.config.subscription.impl.JRTConfigRequester Request failed: Unknown error. Connection spec: tcp/localhost:19090, error message: Request timed out after 569.36 seconds.

indicating that the nodes try to reconnect. The first is the container-to-searchnode, the other one is container-to-config.

The error messages indicate that the node is very slow or overloaded (the ping messages are sent before the pong from the previous ping message is received)

nehajatav commented 9 months ago

If that were the case,

searchnode and container restart would have helped fix the issue
(I think) ping and pong logs would have been seen beyond 5:50 timestamp

vekterli commented 9 months ago

Some questions:

Q1: What is the timeline of the observed behavior in relation to the attached logs/node events? At what points in this timeline did you observe network blips?

Q2: When you say "The storage was healthy as per clustercontroller", from where exactly was this information sourced? I see you included the output of vespa-get-cluster-state which indicates the entire cluster is down (which will be the case when 1 out of 1 node is unavailable). vespa-get-node-state -t storage -i 0 (where 0 in this case is the distribution key of the node) should give a more detailed description in such instances.

Q3: How many cluster controllers are you running? Based on your initial description I'm presuming 1, but the reason I'm asking is because of this log line:

2023-10-07 05:50:14.339 MASTER_ELECTION - - This node is no longer fleetcontroller master.

In a single-node setup this usually implies that connectivity was lost to ZooKeeper, which would prevent the cluster controller from performing the tasks required to mark the cluster as online.

Q4: Related to the above; in the cluster controller logs, are there messages indicating ZooKeeper connectivity loss? And if so, when does the log indicate that connectivity and cluster leadership was reestablished?

--

In the attached searchnode log I observe the restarted searchnode process emitting startup-related messages as expected at around 05:54:43:

[2023-10-07 05:54:43.068] INFO searchnode proton.persistence.filestor.manager Completed listing of 4192 buckets in 44 milliseconds
[2023-10-07 05:54:43.069] INFO searchnode proton.node.server Storage node ready. Done initializing. Giving out of sequence metric event. Config id is testdoc/storage/0

But I am not seeing a subsequent message that implies the node has actually been marked as online in the cluster. It would look something like this (with X being a published state version):

searchnode       proton.state.manager Transitioning from baseline state 'Down' to 'Up'  (cluster state version X)

The lack of such a message indicates that there is no cluster controller in charge of the cluster, which transitively prevents the search interface from coming online on the node.

Q4: When can such a log message finally be observed?

nehajatav commented 9 months ago

A1: The query issue was observed starting 7 am UTC, stayed that way for more than 2 hours before attempting shutdown. Let's say something caused connection refused at 5:50 am UTC (network blip was a suspect but i leave it to your interpretation)

A2: Cluster controller ui showed green, event logs suggest the same: Altered node state in cluster state from 'U, b 11' to 'U, t 1696658082, b 11

A3: 1

A4/5: You seem to be right about zookeeper, a few questions:

what is your suspect of losing zook connection given everything is on single node?
Does this mean that had i restarted config-services instead of searchnode/container, the issue could have resolved?
As long as in a multi node cluster, zook cluster has a quorum, this issue wont occur?
Does this mean that merely the storage event log suggesting the state change is not enough, the distributor and SYSTEMSTATE have to detect the same?

Attaching logs from cluster controller from a time when cluster was healthy till the searchnode was restarted

clustercontroller.log

vekterli commented 9 months ago

what is your suspect of losing zook connection given everything is on single node?

When a content cluster is configured with multiple cluster controllers they use a ZooKeeper quorum to elect a leader so that only 1 cluster controller is authoritative at any given time. When there is only 1 controller configured, there is no competition for leadership, so the controller will always become a leader if it has ZooKeeper connectivity. So the only way to not get leadership in such a configuration is for ZooKeeper to be unavailable.

Does this mean that had i restarted config-services instead of searchnode/container, the issue could have resolved?

The ZooKeeper instances used for cluster controllers are by default colocated in the same processes as the cluster controllers themselves. These both reside within the container-clustercontroller service. Assuming there was no persistent failure, restarting this particular service would likely have resolved the issue.

As long as in a multi node cluster, zook cluster has a quorum, this issue wont occur?

Correct. As long as there is an available majority quorum of ZooKeeper servers and cluster controllers, cluster operation will be effectively uninterrupted (aside from during the transient time periods where failover reelection happens—this will usually be in the expected range of tens of seconds at most). Requiring availability of a majority is also why it never makes sense to configure 2 or 4 cluster controllers, as this gives you no more availability than if you had configured 1 or 3.

Does this mean that merely the storage event log suggesting the state change is not enough, the distributor and SYSTEMSTATE have to detect the same?

When the cluster controller and/or the entire cluster is down, these logs may not necessarily reflect the current state of the cluster. This may be caused both by the fact that the controller is not in a leader state as well as certain events being elided internally when the cluster is entirely down.

For a state change to be visible to the cluster, it has to be published to all nodes as part of a cluster state version change. This will be reflected in the SYSTEMSTATE log, but even more clearly in the Vespa log. Publishing a cluster state version requires a successful ZooKeeper write and will therefore not work if there's no connectivity.

Attaching logs from cluster controller from a time when cluster was healthy till the searchnode was restarted

From what I can see from these logs, ZooKeeper connectivity was lost and never restored for the duration of the log excerpt. This likely explains why the cluster did not return to an available state.

If you want to debug this further, Vespa leaves ZooKeeper logs available for inspection, both for the config server and cluster controller. I would expect to find the latter in $VESPA_HOME/logs/vespa/zookeeper.container-clustercontroller.0.log ^1. These logs may provide a better clue as to why ZooKeeper became unavailable.

As a rule, we always run production clusters with 3 separate cluster controllers (meaning we also run with 3 ZooKeeper servers), ensuring that we can always tolerate the failure of 1 controller node without downtime. These are also on separate nodes from the content and container cluster nodes, to prevent any overload from these from affecting service availability (caused by spurious ZK timeouts, I/O starvation etc).

vespa-engine / vespa

Cluster state is healthy but container unable to search with error 'Connection failure on nodes with distribution-keys: 0' #28833