Closed yashkasat96 closed 3 years ago
log messages like
[2021-06-10 09:27:39.927] WARNING : logd logdemon.vespalib.net.async_resolver could not resolve host name: 'vespa-admin.vespa-net'
[2021-06-10 09:27:39.928] WARNING : slobrok vespa-slobrok.vespalib.net.async_resolver could not resolve host name: 'vespa-configserver-b.vespa-net'
means that something is wrong with the network configuration, so all the hosts cannot talk to each other. I think you must resolve this problem first. please make sure by logging into a nodes and ping the other nodes by name, and configure network accordingly (hostname files etc)
you can review https://github.com/vespa-engine/sample-apps/tree/master/basic-search-on-docker-swarm as well, it looks similar to this
I also recommend a much simpler node configuration like in the sample app I linked above - e.g.
<nodes>
<node hostalias="content0" distribution-key="0" />
<node hostalias="content1" distribution-key="1" />
<node hostalias="content2" distribution-key="2" />
</nodes>
The scenario/requirement is if the 2 instances out of the 3 that I am using are not available due to any reason. The system should be able to cater to the request to add, update, delete, read and search the document. So, I shut down the 2 instances where 'vespa-admin.vespa-net' and 'vespa-configserver-b.vespa-net' containers were present.
I see! that explains the non-connectivity. To make some progress I suggest you inspect the clustercontroller status pages, see https://docs.vespa.ai/en/operations/admin-procedures.html#status-pages - this shows the cluster state as you stop nodes
<searchable-copies>3</searchable-copies> for high availability in this case, this means all documents are always indexed on the three nodes
I also think you should used <nodes> and not <group> to make config simpler for this experiment.
I am quite sure that one node should be able to serve, but it seems like the clustercontroller is not able to contact any of the 3 zookeper instances, which one should be local to the node
it will help understand if you also include hosts.xml, and maybe use a simple naming scheme like node1, node2 and node3
<hosts>
<host name="vespa-admin.vespa-net">
<alias>admin0</alias>
</host>
<host name="vespa-configserver-a.vespa-net">
<alias>configserver1</alias>
</host>
<host name="vespa-configserver-b.vespa-net">
<alias>configserver2</alias>
</host>
</hosts>
thanks. please post findings after inspecting clustercontroller statuspages here
vespa-model-inspect service container-clustercontroller
container-clustercontroller @ vespa-admin.vespa-net : admin
admin/cluster-controllers/0
tcp/vespa-admin.vespa-net:19050 (STATE EXTERNAL QUERY HTTP)
tcp/vespa-admin.vespa-net:19115 (EXTERNAL HTTP)
tcp/vespa-admin.vespa-net:19116 (ADMIN RPC)
container-clustercontroller @ vespa-configserver-a.vespa-net : admin
admin/cluster-controllers/1
tcp/vespa-configserver-a.vespa-net:19050 (STATE EXTERNAL QUERY HTTP)
tcp/vespa-configserver-a.vespa-net:19115 (EXTERNAL HTTP)
tcp/vespa-configserver-a.vespa-net:19116 (ADMIN RPC)
container-clustercontroller @ vespa-configserver-b.vespa-net : admin
admin/cluster-controllers/2
tcp/vespa-configserver-b.vespa-net:19050 (STATE EXTERNAL QUERY HTTP)
tcp/vespa-configserver-b.vespa-net:19115 (EXTERNAL HTTP)
tcp/vespa-configserver-b.vespa-net:19116 (ADMIN RPC)
curl http://vespa-configserver-a.vespa-net:19050/
{
"handlers" : [ {
"id" : "com.yahoo.container.usability.BindingsOverviewHandler",
"class" : "com.yahoo.container.usability.BindingsOverviewHandler",
"bundle" : "container-disc:7.0.0",
"serverBindings" : [ "http://*/" ]
}, {
"id" : "clustercontroller-state-restapi-v2",
"class" : "com.yahoo.vespa.clustercontroller.apps.clustercontroller.StateRestApiV2Handler",
"bundle" : "clustercontroller-apps:7.0.0",
"serverBindings" : [ ]
}, {
"id" : "reindexing-status",
"class" : "ai.vespa.reindexing.http.ReindexingV1ApiHandler",
"bundle" : "clustercontroller-reindexer:7.0.0",
"serverBindings" : [ ]
}, {
"id" : "com.yahoo.container.jdisc.state.StateHandler",
"class" : "com.yahoo.container.jdisc.state.StateHandler",
"bundle" : "container-disc:7.0.0",
"serverBindings" : [ "http://*/state/v1", "http://*/state/v1/*" ]
}, {
"id" : "clustercontroller-status",
"class" : "com.yahoo.vespa.clustercontroller.apps.clustercontroller.StatusHandler",
"bundle" : "clustercontroller-apps:7.0.0",
"serverBindings" : [ ]
}, {
"id" : "com.yahoo.container.handler.observability.ApplicationStatusHandler",
"class" : "com.yahoo.container.handler.observability.ApplicationStatusHandler",
"bundle" : "container-search-and-docproc:7.0.0",
"serverBindings" : [ "http://*/ApplicationStatus" ]
}, {
"id" : "com.yahoo.container.handler.VipStatusHandler",
"class" : "com.yahoo.container.handler.VipStatusHandler",
"bundle" : "container-disc:7.0.0",
"serverBindings" : [ "http://*/status.html" ]
} ]
}
curl http://vespa-configserver-a.vespa-net:19050/clustercontroller-status/v1/
<title>clusters</title>
<a href="./content">content</a><br>
yes, so http://vespa-configserver-a.vespa-net:19050/clustercontroller-status/v1/content should show a UI with cluster states ++
Thanks! We observe that both the distributor and storage node is UP on vespa-configservera.vespa-net - that is good.
I still suggest you change the config from groups to nodes as suggested above - there is some logic to take groups in and out of query serving, I don't think it matters here but anyway please do it for simplicity and less sources of error
You can verify that documents are available by logging into vespa-configservera.vespa-net and run vespa-visit -i
to list IDs of documents stored - please post error message here, if any
id:name:name::001 (Last modified at 0) id:name:name::029 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (179.999 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (179.998 seconds expired); (RPC) Invocation timed out) id:name:name::012 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) id:name:name::032 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) id:name:name::036 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) id:name:name::018 (Last modified at 0) id:name:name::037 (Last modified at 0) id:name:name::005 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) id:name:name::024 (Last modified at 0) id:name:name::008 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) id:name:name::027 (Last modified at 0) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out) Visitor error (2021-06-14 07:16:37 UTC): TIMEOUT: ReturnCode(TIMEOUT, [from content node 1] A timeout occured while waiting for 'tcp/vespa-configserver-a.vespa-net:42197/visitor-1-1623654041125' (180 seconds expired); (RPC) Invocation timed out)
I have tried changing the configuration of the content node from group to nodes. But still, the issue persists.
Issue is that the query http://<ip>:8080/search/?yql=select * from sources * where sddocname contains 'name' limit 1;
is giving only 27 records even though the total records are 40.
Also, when I try to add a document, I didn't get a successful response back. The response that I get for making ADD API call.
{
"pathId": "/document/v1/name/name/docid/041",
"message": "Request timeout after 175000ms"
}
Although the same record that I wanted to add above in the database is available afterward when I try to get that record using GET API.
Also the vespa-visit -i
commands give this kind of response. It contains the list of all the records which is present in the database.
id:name:name::031 (Last modified at 0)
id:name:name::003 (Last modified at 0)
id:name:name::021 (Last modified at 0)
id:name:name::011 (Last modified at 0)
id:name:name::013 (Last modified at 0)
id:name:name::025 (Last modified at 0)
id:name:name::038 (Last modified at 0)
id:name:name::012 (Last modified at 0)
id:name:name::019 (Last modified at 0)
id:name:name::039 (Last modified at 0)
id:name:name::036 (Last modified at 0)
id:name:name::006 (Last modified at 0)
id:name:name::007 (Last modified at 0)
id:name:name::018 (Last modified at 0)
id:name:name::024 (Last modified at 0)
id:name:name::014 (Last modified at 0)
id:name:name::010 (Last modified at 0)
id:name:name::005 (Last modified at 0)
id:name:name::032 (Last modified at 0)
id:name:name::037 (Last modified at 0)
id:name:name::004 (Last modified at 0)
id:name:name::040 (Last modified at 0)
id:name:name::008 (Last modified at 0)
id:name:name::034 (Last modified at 0)
id:name:name::027 (Last modified at 0)
id:name:name::001 (Last modified at 0)
id:name:name::009 (Last modified at 0)
id:name:name::029 (Last modified at 0)
id:name:name::028 (Last modified at 0)
id:name:name::020 (Last modified at 0)
id:name:name::030 (Last modified at 0)
id:name:name::015 (Last modified at 0)
id:name:name::023 (Last modified at 0)
id:name:name::022 (Last modified at 0)
id:name:name::016 (Last modified at 0)
id:name:name::033 (Last modified at 0)
id:name:name::026 (Last modified at 0)
id:name:name::017 (Last modified at 0)
id:name:name::035 (Last modified at 0)
id:name:name::002 (Last modified at 0)
Hi @kkraune, Is there something else that I need to check to find out where the problem is?
Hi, I am sorry for late response. It seems as you are making progress, as you get results, but there are so many things and different configurations / capacity configurations, so I have decided it is easier for me to just create a new sample application with these steps, using a laptop and Docker and 3 nodes and detail the steps needed. That will also help the next user.
There are so many things that can fail (i.e. insufficent capacity can cause service failures), and it is difficult to enumerate all.
It will take me some time to create this sample guide - in the mean time, a 3 node config with redundancy 3 will return the correct set under normal conditions
Thanks for the response, Please share the link for the guide whenever available.
Update: I figured it out. I am writing up a longer story/script, the short story is that zookeeper cannot work with 1-out-of-3 being up, so the cluster state is never updated (the data is there, but not all nodes are queried). I am looking into alternative config as well, so it will take me more time - but now I understand why your test fails, this is a great issue report! I will keep you posted
@yashkasat96 Please take a look at https://github.com/vespa-engine/sample-apps/tree/master/operations/multinode and let me know things to improve or clarify - thanks!
I had a requirement where I have to deploy the highly available Vespa database on 3 instances. What I need is if any of the 2 instances are down then also the database should be able to add, delete, update, read and search on whole data.
I have deployed Vespa using docker containers. 3 contains created on 3 different instances. To make it highly available I made all the instances as config node, container node, and content node.
The configuration (service.xml) that I am using is.
The commands that I have been using to create a docker container are mentioned below. Instance-A:
docker run --detach --privileged --name vespa-admin --hostname vespa-admin.vespa-net --network=vespa-net --env VESPA_CONFIGSERVERS=vespa-admin.vespa-net,vespa-configserver-a.vespa-net,vespa-configserver-b.vespa-net --restart unless-stopped <other-arguments> vespaengine/vespa
Instance-B:
docker run --detach --privileged --name vespa-configserver-a --hostname vespa-configserver-a.vespa-net --network=vespa-net --env VESPA_CONFIGSERVERS=vespa-admin.vespa-net,vespa-configserver-a.vespa-net,vespa-configserver-b.vespa-net --restart unless-stopped <other-arguments> vespaengine/vespa
Instance-C:
docker run --detach --privileged --name vespa-configserver-b --hostname vespa-configserver-b.vespa-net --network=vespa-net --env VESPA_CONFIGSERVERS=vespa-admin.vespa-net,vespa-configserver-a.vespa-net,vespa-configserver-b.vespa-net --restart unless-stopped <other-arguments> vespaengine/vespa
Now when I deploy the application it gives out this:
Is any of this log info need to be handled so that it won't create any problem?
When I stop any 1 of the instances the request (add, update, delete, read and search) is processed on the other two nodes successfully. But when I try to stop any of the 2 instances simultaneously then I am not able to get a response for the requests(add, update, read, delete) and also the inactive records in that instances didn't get active.
The logs from the command vespa-logfmt -l warning,error is shown below.
And also was not able to connect to the zookeeper using 'bash /opt/vespa/bin/vespa-zkcli'.
What should I do to make it highly available system. That is if any 2 of the nodes becomes inactive due to some reason then also the system should be available for all the operations.