vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.48k stars 584 forks source link

Wanted to deploy highly available Vespa database #18194

Closed yashkasat96 closed 3 years ago

yashkasat96 commented 3 years ago

I had a requirement where I have to deploy the highly available Vespa database on 3 instances. What I need is if any of the 2 instances are down then also the database should be able to add, delete, update, read and search on whole data.

I have deployed Vespa using docker containers. 3 contains created on 3 different instances. To make it highly available I made all the instances as config node, container node, and content node.

The configuration (service.xml) that I am using is.

<services version="1.0">
    <admin version="2.0">
        <adminserver hostalias="admin0"/>

        <configservers>
            <configserver hostalias="admin0"/>
            <configserver hostalias="configserver1"/>
            <configserver hostalias="configserver2"/>
        </configservers>

    </admin>

    <container id="container" version="1.0">
        <document-api />
        <search/>
        <nodes>
            <node hostalias="admin0"/>
            <node hostalias="configserver1"/>
            <node hostalias="configserver2"/>
        </nodes>
    </container>

    <content id="content" version="1.0">
        <documents>
            <document type="document_name" mode="index" />
        </documents>

        <redundancy>3</redundancy>
        <engine>
            <proton>
                <searchable-copies>1</searchable-copies>
                <resource-limits>
                    <disk>0.90</disk>
                    <memory>0.90</memory>
                </resource-limits>
                <tuning>
                    <searchnode>
                        <feeding>
                            <concurrency>0.70</concurrency>
                        </feeding>
                    </searchnode>
                </tuning>
            </proton>
        </engine>

        <group name="top-group">
            <distribution partitions="*"/>
            <group name="group0" distribution-key="0">
                <node hostalias="admin0" distribution-key="0"/>
                <node hostalias="configserver1" distribution-key="1"/>
                <node hostalias="configserver2" distribution-key="2"/>
            </group>
        </group>
    </content>
</services>

The commands that I have been using to create a docker container are mentioned below. Instance-A: docker run --detach --privileged --name vespa-admin --hostname vespa-admin.vespa-net --network=vespa-net --env VESPA_CONFIGSERVERS=vespa-admin.vespa-net,vespa-configserver-a.vespa-net,vespa-configserver-b.vespa-net --restart unless-stopped <other-arguments> vespaengine/vespa

Instance-B: docker run --detach --privileged --name vespa-configserver-a --hostname vespa-configserver-a.vespa-net --network=vespa-net --env VESPA_CONFIGSERVERS=vespa-admin.vespa-net,vespa-configserver-a.vespa-net,vespa-configserver-b.vespa-net --restart unless-stopped <other-arguments> vespaengine/vespa

Instance-C: docker run --detach --privileged --name vespa-configserver-b --hostname vespa-configserver-b.vespa-net --network=vespa-net --env VESPA_CONFIGSERVERS=vespa-admin.vespa-net,vespa-configserver-a.vespa-net,vespa-configserver-b.vespa-net --restart unless-stopped <other-arguments> vespaengine/vespa

Now when I deploy the application it gives out this:

INFO: 'distribution-key' attribute on a content cluster's root group is ignored
INFO: When having content clusters and more than 1 config server it is recommended to configure cluster controllers explicitly.
WARNING: Directory searchdefinitions/ should not be used for schemas, use schemas/ instead

Is any of this log info need to be handled so that it won't create any problem?

When I stop any 1 of the instances the request (add, update, delete, read and search) is processed on the other two nodes successfully. But when I try to stop any of the 2 instances simultaneously then I am not able to get a response for the requests(add, update, read, delete) and also the inactive records in that instances didn't get active.

The logs from the command vespa-logfmt -l warning,error is shown below.

[2021-06-10 09:24:28.842] WARNING : container-clustercontroller stderr  SLF4J: Class path contains multiple SLF4J bindings.
[2021-06-10 09:24:28.855] WARNING : container-clustercontroller stderr  SLF4J: Found binding in [bundle://d8dffcce-c5c8-4440-82a4-f4b11e547eb3_17.0:6/org/slf4j/impl/StaticLoggerBinder.class]
[2021-06-10 09:24:28.855] WARNING : container-clustercontroller stderr  SLF4J: Found binding in [bundle://d8dffcce-c5c8-4440-82a4-f4b11e547eb3_17.0:8/org/slf4j/impl/StaticLoggerBinder.class]
[2021-06-10 09:24:28.855] WARNING : container-clustercontroller stderr  SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
[2021-06-10 09:24:28.855] WARNING : container-clustercontroller stderr  SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
[2021-06-10 09:24:34.229] WARNING : configserver     stderr SLF4J: Class path contains multiple SLF4J bindings.
[2021-06-10 09:24:34.229] WARNING : configserver     stderr SLF4J: Found binding in [bundle://c90c10ab-d2f6-43eb-88c2-660ddaab90d1_22.0:17/org/slf4j/impl/StaticLoggerBinder.class]
[2021-06-10 09:24:34.229] WARNING : configserver     stderr SLF4J: Found binding in [bundle://c90c10ab-d2f6-43eb-88c2-660ddaab90d1_22.0:18/org/slf4j/impl/StaticLoggerBinder.class]
[2021-06-10 09:24:34.229] WARNING : configserver     stderr SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
[2021-06-10 09:24:34.231] WARNING : configserver     stderr SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
[2021-06-10 09:24:34.274] WARNING : configserver     stderr SLF4J: Class path contains multiple SLF4J bindings.
[2021-06-10 09:24:34.274] WARNING : configserver     stderr SLF4J: Found binding in [bundle://c90c10ab-d2f6-43eb-88c2-660ddaab90d1_21.0:6/org/slf4j/impl/StaticLoggerBinder.class]
[2021-06-10 09:24:34.274] WARNING : configserver     stderr SLF4J: Found binding in [bundle://c90c10ab-d2f6-43eb-88c2-660ddaab90d1_21.0:8/org/slf4j/impl/StaticLoggerBinder.class]
[2021-06-10 09:24:34.274] WARNING : configserver     stderr SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
[2021-06-10 09:24:34.278] WARNING : configserver     stderr SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
[2021-06-10 09:24:37.865] WARNING : container        Container.com.yahoo.search.cluster.BaseNodeMonitor Taking search node key = 1 hostname = vespa-configserver-a.vespa-net path = 1 in group 0 statusIsKnown = false working = true activeDocs = 0 out of service: Connection failure: 10: Backend communication error: Node id 1 reports being offline
[2021-06-10 09:27:25.172] WARNING : container        Container.com.yahoo.search.cluster.BaseNodeMonitor Taking search node key = 2 hostname = vespa-configserver-b.vespa-net path = 2 in group 0 statusIsKnown = true working = true activeDocs = 7 out of service: Connection failure: 10: Backend communication error: Error response from rpc node connection to vespa-configserver-b.vespa-net:19106: Connection error
[2021-06-10 09:27:38.751] WARNING : distributor      vds.vespalib.net.async_resolver    could not resolve host name: 'vespa-configserver-b.vespa-net'
[2021-06-10 09:27:38.752] WARNING : distributor      vds.vespalib.net.async_resolver    could not resolve host name: 'vespa-configserver-b.vespa-net'
[2021-06-10 09:27:38.923] WARNING : logd             logdemon.logd.rpc_forwarder    Error in rpc reply from logserver ('tcp/vespa-admin.vespa-net:19080'): '(RPC) Connection error'
[2021-06-10 09:27:39.204] WARNING : container        Container.com.yahoo.search.cluster.BaseNodeMonitor Taking search node key = 0 hostname = vespa-admin.vespa-net path = 0 in group 0 statusIsKnown = true working = true activeDocs = 6 out of service: Connection failure: 10: Backend communication error: Error response from rpc node connection to vespa-admin.vespa-net:19106: Connection error
[2021-06-10 09:27:39.927] WARNING : logd             logdemon.vespalib.net.async_resolver   could not resolve host name: 'vespa-admin.vespa-net'
[2021-06-10 09:27:39.928] WARNING : slobrok          vespa-slobrok.vespalib.net.async_resolver  could not resolve host name: 'vespa-configserver-b.vespa-net'
[2021-06-10 09:27:40.914] WARNING : container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler   Fleetcontroller 1: Failed to connect to ZooKeeper at vespa-admin.vespa-net:2181,vespa-configserver-a.vespa-net:2181,vespa-configserver-b.vespa-net:2181 with session timeout 30000: java.lang.NullPointerException\n\tat org.apache.zookeeper.ClientCnxnSocketNetty.onClosing(ClientCnxnSocketNetty.java:247)\n\tat org.apache.zookeeper.ClientCnxn$SendThread.close(ClientCnxn.java:1465)\n\tat org.apache.zookeeper.ClientCnxn.disconnect(ClientCnxn.java:1508)\n\tat org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1537)\n\tat org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:1614)\n\tat com.yahoo.vespa.clustercontroller.core.database.ZooKeeperDatabase.<init>(ZooKeeperDatabase.java:120)\n\tat com.yahoo.vespa.clustercontroller.core.database.ZooKeeperDatabaseFactory.create(ZooKeeperDatabaseFactory.java:8)\n\tat com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler.connect(DatabaseHandler.java:197)\n\tat com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler.doNextZooKeeperTask(DatabaseHandler.java:252)\n\tat com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:604)\n\tat com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1127)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n
[2021-06-10 09:27:50.048] WARNING : container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler   Fleetcontroller 1: Failed to connect to ZooKeeper at vespa-admin.vespa-net:2181,vespa-configserver-a.vespa-net:2181,vespa-configserver-b.vespa-net:2181 with session timeout 30000: java.lang.NullPointerException\n\tat org.apache.zookeeper.ClientCnxnSocketNetty.onClosing(ClientCnxnSocketNetty.java:247)\n\tat org.apache.zookeeper.ClientCnxn$SendThread.close(ClientCnxn.java:1465)\n\tat org.apache.zookeeper.ClientCnxn.disconnect(ClientCnxn.java:1508)\n\tat org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1537)\n\tat org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:1614)\n\tat com.yahoo.vespa.clustercontroller.core.database.ZooKeeperDatabase.<init>(ZooKeeperDatabase.java:120)\n\tat com.yahoo.vespa.clustercontroller.core.database.ZooKeeperDatabaseFactory.create(ZooKeeperDatabaseFactory.java:8)\n\tat com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler.connect(DatabaseHandler.java:197)\n\tat com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler.doNextZooKeeperTask(DatabaseHandler.java:252)\n\tat com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:604)\n\tat com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1127)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n
[2021-06-10 09:27:54.732] WARNING : slobrok          vespa-slobrok.vespalib.net.async_resolver  could not resolve host name: 'vespa-admin.vespa-net'
[2021-06-10 09:28:00.070] WARNING : container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler   Fleetcontroller 1: Failed to connect to ZooKeeper at vespa-admin.vespa-net:2181,vespa-configserver-a.vespa-net:2181,vespa-configserver-b.vespa-net:2181 with session timeout 30000: java.lang.NullPointerException\n\tat org.apache.zookeeper.ClientCnxnSocketNetty.onClosing(ClientCnxnSocketNetty.java:247)\n\tat org.apache.zookeeper.ClientCnxn$SendThread.close(ClientCnxn.java:1465)\n\tat org.apache.zookeeper.ClientCnxn.disconnect(ClientCnxn.java:1508)\n\tat org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1537)\n\tat org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:1614)\n\tat com.yahoo.vespa.clustercontroller.core.database.ZooKeeperDatabase.<init>(ZooKeeperDatabase.java:120)\n\tat com.yahoo.vespa.clustercontroller.core.database.ZooKeeperDatabaseFactory.create(ZooKeeperDatabaseFactory.java:8)\n\tat com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler.connect(DatabaseHandler.java:197)\n\tat com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler.doNextZooKeeperTask(DatabaseHandler.java:252)\n\tat com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:604)\n\tat com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1127)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n
[2021-06-10 09:28:09.974] WARNING : container-clustercontroller Container.com.yahoo.vespa.curator.Curator   ZK connection state change: LOST
[2021-06-10 09:28:10.034] WARNING : configserver     Container.com.yahoo.vespa.curator.Curator  ZK connection state change: LOST
[2021-06-10 09:28:10.043] WARNING : container-clustercontroller Container.com.yahoo.vespa.curator.Curator   ZK connection state change: LOST
[2021-06-10 09:28:10.100] WARNING : container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler   Fleetcontroller 1: Failed to connect to ZooKeeper at vespa-admin.vespa-net:2181,vespa-configserver-a.vespa-net:2181,vespa-configserver-b.vespa-net:2181 with session timeout 30000: KeeperErrorCode = ConnectionLoss for /vespa
[2021-06-10 09:28:20.121] WARNING : container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler   Fleetcontroller 1: Failed to connect to ZooKeeper at vespa-admin.vespa-net:2181,vespa-configserver-a.vespa-net:2181,vespa-configserver-b.vespa-net:2181 with session timeout 30000: java.lang.NullPointerException\n\tat org.apache.zookeeper.ClientCnxnSocketNetty.onClosing(ClientCnxnSocketNetty.java:247)\n\tat org.apache.zookeeper.ClientCnxn$SendThread.close(ClientCnxn.java:1465)\n\tat org.apache.zookeeper.ClientCnxn.disconnect(ClientCnxn.java:1508)\n\tat org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1537)\n\tat org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:1614)\n\tat com.yahoo.vespa.clustercontroller.core.database.ZooKeeperDatabase.<init>(ZooKeeperDatabase.java:120)\n\tat com.yahoo.vespa.clustercontroller.core.database.ZooKeeperDatabaseFactory.create(ZooKeeperDatabaseFactory.java:8)\n\tat com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler.connect(DatabaseHandler.java:197)\n\tat com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler.doNextZooKeeperTask(DatabaseHandler.java:252)\n\tat com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:604)\n\tat com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1127)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n
[2021-06-10 09:28:30.686] WARNING : container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.database.DatabaseHandler   Fleetcontroller 1: Failed to connect to ZooKeeper at vespa-admin.vespa-net:2181,vespa-configserver-a.vespa-net:2181,vespa-configserver-b.vespa-net:2181 with session timeout 30000: KeeperErrorCode = ConnectionLoss for /vespa

And also was not able to connect to the zookeeper using 'bash /opt/vespa/bin/vespa-zkcli'.

Connecting to localhost:2181
Welcome to ZooKeeper!
JLine support is disabled
[WARN ] Session 0x0 for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
EndOfStreamException: channel for sessionid 0x0 is lost
    at org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:285)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290)
[WARN ] Session 0x0 for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
EndOfStreamException: channel for sessionid 0x0 is lost
    at org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:285)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290)
[WARN ] Session 0x0 for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
EndOfStreamException: channel for sessionid 0x0 is lost
    at org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:285)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290)

What should I do to make it highly available system. That is if any 2 of the nodes becomes inactive due to some reason then also the system should be available for all the operations.

kkraune commented 3 years ago

log messages like

[2021-06-10 09:27:39.927] WARNING : logd             logdemon.vespalib.net.async_resolver   could not resolve host name: 'vespa-admin.vespa-net'
[2021-06-10 09:27:39.928] WARNING : slobrok          vespa-slobrok.vespalib.net.async_resolver  could not resolve host name: 'vespa-configserver-b.vespa-net'

means that something is wrong with the network configuration, so all the hosts cannot talk to each other. I think you must resolve this problem first. please make sure by logging into a nodes and ping the other nodes by name, and configure network accordingly (hostname files etc)

you can review https://github.com/vespa-engine/sample-apps/tree/master/basic-search-on-docker-swarm as well, it looks similar to this

kkraune commented 3 years ago

I also recommend a much simpler node configuration like in the sample app I linked above - e.g.

<nodes>
    <node hostalias="content0" distribution-key="0" />
    <node hostalias="content1" distribution-key="1" />
    <node hostalias="content2" distribution-key="2" />
</nodes>
yashkasat96 commented 3 years ago

The scenario/requirement is if the 2 instances out of the 3 that I am using are not available due to any reason. The system should be able to cater to the request to add, update, delete, read and search the document. So, I shut down the 2 instances where 'vespa-admin.vespa-net' and 'vespa-configserver-b.vespa-net' containers were present.

kkraune commented 3 years ago

I see! that explains the non-connectivity. To make some progress I suggest you inspect the clustercontroller status pages, see https://docs.vespa.ai/en/operations/admin-procedures.html#status-pages - this shows the cluster state as you stop nodes

<searchable-copies>3</searchable-copies> for high availability in this case, this means all documents are always indexed on the three nodes

I also think you should used <nodes> and not <group> to make config simpler for this experiment.

I am quite sure that one node should be able to serve, but it seems like the clustercontroller is not able to contact any of the 3 zookeper instances, which one should be local to the node

it will help understand if you also include hosts.xml, and maybe use a simple naming scheme like node1, node2 and node3

yashkasat96 commented 3 years ago
<hosts>

  <host name="vespa-admin.vespa-net">
    <alias>admin0</alias>
  </host>

  <host name="vespa-configserver-a.vespa-net">
    <alias>configserver1</alias>
  </host>

  <host name="vespa-configserver-b.vespa-net">
    <alias>configserver2</alias>
  </host>

</hosts>
kkraune commented 3 years ago

thanks. please post findings after inspecting clustercontroller statuspages here

yashkasat96 commented 3 years ago

vespa-model-inspect service container-clustercontroller

container-clustercontroller @ vespa-admin.vespa-net : admin
admin/cluster-controllers/0
    tcp/vespa-admin.vespa-net:19050 (STATE EXTERNAL QUERY HTTP)
    tcp/vespa-admin.vespa-net:19115 (EXTERNAL HTTP)
    tcp/vespa-admin.vespa-net:19116 (ADMIN RPC)
container-clustercontroller @ vespa-configserver-a.vespa-net : admin
admin/cluster-controllers/1
    tcp/vespa-configserver-a.vespa-net:19050 (STATE EXTERNAL QUERY HTTP)
    tcp/vespa-configserver-a.vespa-net:19115 (EXTERNAL HTTP)
    tcp/vespa-configserver-a.vespa-net:19116 (ADMIN RPC)
container-clustercontroller @ vespa-configserver-b.vespa-net : admin
admin/cluster-controllers/2
    tcp/vespa-configserver-b.vespa-net:19050 (STATE EXTERNAL QUERY HTTP)
    tcp/vespa-configserver-b.vespa-net:19115 (EXTERNAL HTTP)
    tcp/vespa-configserver-b.vespa-net:19116 (ADMIN RPC)

curl http://vespa-configserver-a.vespa-net:19050/

{
  "handlers" : [ {
    "id" : "com.yahoo.container.usability.BindingsOverviewHandler",
    "class" : "com.yahoo.container.usability.BindingsOverviewHandler",
    "bundle" : "container-disc:7.0.0",
    "serverBindings" : [ "http://*/" ]
  }, {
    "id" : "clustercontroller-state-restapi-v2",
    "class" : "com.yahoo.vespa.clustercontroller.apps.clustercontroller.StateRestApiV2Handler",
    "bundle" : "clustercontroller-apps:7.0.0",
    "serverBindings" : [ ]
  }, {
    "id" : "reindexing-status",
    "class" : "ai.vespa.reindexing.http.ReindexingV1ApiHandler",
    "bundle" : "clustercontroller-reindexer:7.0.0",
    "serverBindings" : [ ]
  }, {
    "id" : "com.yahoo.container.jdisc.state.StateHandler",
    "class" : "com.yahoo.container.jdisc.state.StateHandler",
    "bundle" : "container-disc:7.0.0",
    "serverBindings" : [ "http://*/state/v1", "http://*/state/v1/*" ]
  }, {
    "id" : "clustercontroller-status",
    "class" : "com.yahoo.vespa.clustercontroller.apps.clustercontroller.StatusHandler",
    "bundle" : "clustercontroller-apps:7.0.0",
    "serverBindings" : [ ]
  }, {
    "id" : "com.yahoo.container.handler.observability.ApplicationStatusHandler",
    "class" : "com.yahoo.container.handler.observability.ApplicationStatusHandler",
    "bundle" : "container-search-and-docproc:7.0.0",
    "serverBindings" : [ "http://*/ApplicationStatus" ]
  }, {
    "id" : "com.yahoo.container.handler.VipStatusHandler",
    "class" : "com.yahoo.container.handler.VipStatusHandler",
    "bundle" : "container-disc:7.0.0",
    "serverBindings" : [ "http://*/status.html" ]
  } ]
}

curl http://vespa-configserver-a.vespa-net:19050/clustercontroller-status/v1/

<title>clusters</title>
<a href="./content">content</a><br>
kkraune commented 3 years ago

yes, so http://vespa-configserver-a.vespa-net:19050/clustercontroller-status/v1/content should show a UI with cluster states ++

yashkasat96 commented 3 years ago

content Cluster Controller 1 Status Page.pdf

kkraune commented 3 years ago

Thanks! We observe that both the distributor and storage node is UP on vespa-configservera.vespa-net - that is good.

I still suggest you change the config from groups to nodes as suggested above - there is some logic to take groups in and out of query serving, I don't think it matters here but anyway please do it for simplicity and less sources of error

You can verify that documents are available by logging into vespa-configservera.vespa-net and run vespa-visit -i to list IDs of documents stored - please post error message here, if any

yashkasat96 commented 3 years ago

id:name:name::001 (Last modified at 0) id:name:name::029 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (179.999 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (179.998 seconds expired); (RPC) Invocation timed out) id:name:name::012 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) id:name:name::032 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) id:name:name::036 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) id:name:name::018 (Last modified at 0) id:name:name::037 (Last modified at 0) id:name:name::005 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) id:name:name::024 (Last modified at 0) id:name:name::008 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) id:name:name::027 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out)

yashkasat96 commented 3 years ago

I have tried changing the configuration of the content node from group to nodes. But still, the issue persists.

Issue is that the query http://<ip>:8080/search/?yql=select * from sources * where sddocname contains 'name' limit 1; is giving only 27 records even though the total records are 40. Also, when I try to add a document, I didn't get a successful response back. The response that I get for making ADD API call.

{
    "pathId": "/document/v1/name/name/docid/041",
    "message": "Request timeout after 175000ms"
}

Although the same record that I wanted to add above in the database is available afterward when I try to get that record using GET API.

Also the vespa-visit -i commands give this kind of response. It contains the list of all the records which is present in the database.

id:name:name::031 (Last modified at 0)
id:name:name::003 (Last modified at 0)
id:name:name::021 (Last modified at 0)
id:name:name::011 (Last modified at 0)
id:name:name::013 (Last modified at 0)
id:name:name::025 (Last modified at 0)
id:name:name::038 (Last modified at 0)
id:name:name::012 (Last modified at 0)
id:name:name::019 (Last modified at 0)
id:name:name::039 (Last modified at 0)
id:name:name::036 (Last modified at 0)
id:name:name::006 (Last modified at 0)
id:name:name::007 (Last modified at 0)
id:name:name::018 (Last modified at 0)
id:name:name::024 (Last modified at 0)
id:name:name::014 (Last modified at 0)
id:name:name::010 (Last modified at 0)
id:name:name::005 (Last modified at 0)
id:name:name::032 (Last modified at 0)
id:name:name::037 (Last modified at 0)
id:name:name::004 (Last modified at 0)
id:name:name::040 (Last modified at 0)
id:name:name::008 (Last modified at 0)
id:name:name::034 (Last modified at 0)
id:name:name::027 (Last modified at 0)
id:name:name::001 (Last modified at 0)
id:name:name::009 (Last modified at 0)
id:name:name::029 (Last modified at 0)
id:name:name::028 (Last modified at 0)
id:name:name::020 (Last modified at 0)
id:name:name::030 (Last modified at 0)
id:name:name::015 (Last modified at 0)
id:name:name::023 (Last modified at 0)
id:name:name::022 (Last modified at 0)
id:name:name::016 (Last modified at 0)
id:name:name::033 (Last modified at 0)
id:name:name::026 (Last modified at 0)
id:name:name::017 (Last modified at 0)
id:name:name::035 (Last modified at 0)
id:name:name::002 (Last modified at 0)
yashkasat96 commented 3 years ago

Hi @kkraune, Is there something else that I need to check to find out where the problem is?

kkraune commented 3 years ago

Hi, I am sorry for late response. It seems as you are making progress, as you get results, but there are so many things and different configurations / capacity configurations, so I have decided it is easier for me to just create a new sample application with these steps, using a laptop and Docker and 3 nodes and detail the steps needed. That will also help the next user.

There are so many things that can fail (i.e. insufficent capacity can cause service failures), and it is difficult to enumerate all.

It will take me some time to create this sample guide - in the mean time, a 3 node config with redundancy 3 will return the correct set under normal conditions

yashkasat96 commented 3 years ago

Thanks for the response, Please share the link for the guide whenever available.

kkraune commented 3 years ago

Update: I figured it out. I am writing up a longer story/script, the short story is that zookeeper cannot work with 1-out-of-3 being up, so the cluster state is never updated (the data is there, but not all nodes are queried). I am looking into alternative config as well, so it will take me more time - but now I understand why your test fails, this is a great issue report! I will keep you posted

kkraune commented 3 years ago

@yashkasat96 Please take a look at https://github.com/vespa-engine/sample-apps/tree/master/operations/multinode and let me know things to improve or clarify - thanks!