vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.79k stars 604 forks source link

How to use node-repository when there are a lot of nodes? #5507

Closed knowmara closed 6 years ago

knowmara commented 6 years ago

Hi, all We have many instance nodes, so statically specifying the service node can be a hassle. I found that the node-repository comment says it can help allocate the nodes. Using node-repository in service.xml requires a custom config model plugin or some special elements in service.xml? thanks

bratseth commented 6 years ago

Yes, it needs something like this services.xml snippet included in your services.xml: https://github.com/vespa-engine/vespa/blob/master/node-repository/src/main/config/node-repository.xml

knowmara commented 6 years ago

Thanks,I will try it.

knowmara commented 6 years ago

Hi @bratseth , There are some questions to ask~

  1. What is the detailed definition of hosted vespa,and what is the difference between no hosted system and hosted vespa?
  2. When I set up cloudconfig_server.hosted_vespa in default-env.txt and redeploy a application,I am prompted Invalid application package: [tenant.env]: Error loading model: Allocating a single host in a hosted environment is not possible.What does a single host mean here? My services.xml is as follow: services.xml.txt Thanks.
bratseth commented 6 years ago
  1. Hosted Vespa is our Vespa PaaS. The core code for that is part of Vespa. It uses multitenant shared config server clusters and uses a node repository (using the setup pointed to above) to dynamically allocate nodes to applications. The applications do not use a hosts.xml file and node lists, but instead in services.xml.
  2. Hosted vespa + node lists & hosts.xml files is not a supported combination (this error message comes from attempting to allocate each node mentioned in the node list in services.
knowmara commented 6 years ago

Thank you for your detailed introduction. According to the above example, I has not configured it successfully..... I feel it lacks some important configuration items, such as the number of proton instance. Is there a sample configuration of services.xml using node-repository or some docs about how to use node-repository, I will be very grateful.

bratseth commented 6 years ago

You can find plenty of examples of both hosted and non-hosted setups in com.yahoo.config.model.provision.ModelProvisioningTest (The hosted tests are those where nodes are configured with a count instead of a node list.)

knowmara commented 6 years ago

OK, thanks.

knowmara commented 6 years ago

Invalid application package: default.default: Error loading model: Unknown flavor 'default'. Flavors are []

I encountered a problem like above when I deploy an application,and the application's services.xml has been configured as com.yahoo.config.model.provision.ModelProvisioningTest shows. I need to add a flavors configuration file in the $VESPA_HOME/conf/configserver-app/ directory? Perhaps its name is node-flavors.xml(However,I have not found any example about it)? One more question,when I use node-repository,who is responsible for adding nodes to Node-Repository? I found com.yahoo.vespa.hosted.provision.restapi.v2.NodesApiHandler which offers a uri for adding nodes,but not clear that which component would use the uri.

bratseth commented 6 years ago

Flavors:

General fyi: All the config files that have some meaning in an application package are listed in com.yahoo.config.application.api. ApplicationPackage. The totality of these cause Vespa config instances to be created for the various processes. This translation is the responsibility of the config model. In addition, you can set specific configs directly in services.xml using the general config syntax - see the bottom of http://docs.vespa.ai/documentation/reference/services.html

"flavors" is just another of these configs, specificly config-provisioning/src/main/resources/configdefinitions/flavors.def

So, you can add flavors to your (config server) services.xml using the generic config syntax to add fields according to the spec in flavors.def.

who is responsible for adding nodes to Node-Repository

Yes, that's the right API. You need to add the nodes by calling that API (manually or from some system which is able to provision nodes in your internal system). Note that if you use Docker tenant nodes you only need to add the Docker hosts that will hold those tenants.

knowmara commented 6 years ago

@bratseth Thanks so much.After I configured the flavors in config-server's services.xml and added nodes by /nodes/v2/node(the host staging-001.vm is one of them),I encountered the blow problem when I continued deploying my application.

Invalid application package: default.default: Error loading model: Multiple services/instances of the id 'staging-001.vm' under the service/instance HostSystem 'hosts'. (This is commonly caused by service/node index collisions in the config.). Existing instance: host 'staging-001.vm' Attempted to add: host 'null'

I have started up a config-server on staging-001.vm. So is config-server conflicting with a service/instance?

bratseth commented 6 years ago

Yes. In multitenant mode you cannot use the config server as a tenant node.

knowmara commented 6 years ago

Okay, thanks.

knowmara commented 6 years ago

Sorry, I have some questions to ask.

  1. I tried to add a node that is isolated from other node's network and found that this node can be assigned to the applicaton that I deployed next. That is to say, config-server does not check for new additions. When the service defined in services.xml is not started long enough (because of network isolation, the config-proxy on the node in question cannot receive the command. Does config-server re-assign nodes to application, assuming there are other free nodes?
  2. The addition of new nodes requires external systems, and the discovery and removal of failed nodes is also done by external systems (via delete /nodes/v2/node/${hostname})?
bratseth commented 6 years ago
  1. Yes, if a node fails/disappears from the network it will be replaced if there are replacements available, but this is done by a background job in the config server (NodeFailer) which by default runs every half hour.
  2. The system heartbeats all the nodes so such failure discovery is automatic. (We have additional checking which is based on node-local hardware checks, which updates the node repo on problems - these are not open sourced.)
bratseth commented 6 years ago

Failed nodes are just moved to the "failed" state. Deletion is manual yes.

knowmara commented 6 years ago

The heartbeats by config-proxy or just Ping IP?

bratseth commented 6 years ago

There's a separate service (slobrok) which monitors the liveness of all services by heartbeating. That information ends up in the node failer.

That's for active nodes (running an application and thus having services). Ready nodes are not assigned to an application and thus are not running any services. For those the config servers keeps track of when they last requested config through the config proxy (the ready nodes stay in a pattern where they are requesting config telling them which services to start).

knowmara commented 6 years ago

Thanks for your detailed answer,I will close the issue. Thanks.

paolodedios commented 6 years ago

@bratseth I was able to set up a multi node system using one configserver and four service nodes via docker containers on a single physical host. The config server node seems to start up without incident and added four nodes via the API path '/nodes/v2/node'. But every time I deploy an application I get the following error message.

Request failed. HTTP status code: 400
Could not satisfy request for 2 nodes of flavor 'large' for container cluster 'container' group 0 6.240 in default.default.

It seems to be reporting an OUT_OF_CAPACITY error according the snippet produced by the configserver logs detailed below.

[2018-08-07 06:22:24.675] INFO    : configserver     Container.org.eclipse.jetty.server.AbstractConnector       Started configserver@66a50c93{HTTP/1.1,[http/1.1]}{0.0.0.0:19071}
[2018-08-07 06:22:24.675] INFO    : configserver     Container.org.eclipse.jetty.server.Server  Started @6780ms
[2018-08-07 06:22:24.675] INFO    : configserver     Container.com.yahoo.container.jdisc.ConfiguredApplication  Switching to the latest deployed set of configurations and components. Application switch number: 0
[2018-08-07 06:24:34.302] INFO    : configserver     Container.com.yahoo.vespa.hosted.provision.persistence.CuratorDatabaseClient       Added provisioned node vespa-container-2
[2018-08-07 06:24:39.291] INFO    : configserver     Container.com.yahoo.vespa.hosted.provision.persistence.CuratorDatabaseClient       Added provisioned node vespa-container-3
[2018-08-07 06:24:43.576] INFO    : configserver     Container.com.yahoo.vespa.hosted.provision.persistence.CuratorDatabaseClient       Added provisioned node vespa-container-4
[2018-08-07 06:24:47.726] INFO    : configserver     Container.com.yahoo.vespa.hosted.provision.persistence.CuratorDatabaseClient       Added provisioned node vespa-container-5
[2018-08-07 06:24:52.006] INFO    : configserver     Container.com.yahoo.vespa.hosted.provision.persistence.CuratorDatabaseClient       Added provisioned node vespa-container-6
[2018-08-07 06:25:05.364] INFO    : configserver     Container.com.yahoo.vespa.config.server.http.HttpErrorResponse     Returning response with response code 400, error-code:OUT_OF_CAPACITY, message=Could not satisfy request for 2 nodes of flavor 'large' for container cluster 'container' group 0 6.240 in default.default.

I added the following flavor configuration to the configserver-app/services.xml file:

<?xml version="1.0" encoding="utf-8" ?>
<!-- Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->
<services version="1.0" xmlns:preprocess="properties">
  <jdisc id="configserver" jetty="true" version="1.0">
    <config name="config.provisioning.node-repository">
      <dockerImage>vespaengine/vespa</dockerImage>
    </config>
    <config name="config.provisioning.flavors">
      <flavor>
        <item>
          <name>large</name>
          <environment>DOCKER_CONTAINER</environment>
          <minCpuCores>4</minCpuCores>
          <minMainMemoryAvailableGb>4.5</minMainMemoryAvailableGb>
          <minDiskAvailableGb>128</minDiskAvailableGb>
        </item>
      </flavor>
    </config>
    <config name="container.handler.threadpool">
      <maxthreads>100</maxthreads> <!-- Reduced thread count to minimize memory consumption -->
    </config>

And I modified the basic-search application services.xml configuration to use the node-repository.

?xml version="1.0" encoding="utf-8" ?>
<!-- Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->
<services version="1.0">
  <container id="container" version="1.0">

    <component id="com.yahoo.vespa.hosted.provision.provisioning.NodeRepositoryProvisioner" bundle="node-repository" />
    <component id="NodeRepository" class="com.yahoo.vespa.hosted.provision.NodeRepository" bundle="node-repository"/>
    <component id="com.yahoo.vespa.hosted.provision.maintenance.NodeRepositoryMaintenance" bundle="node-repository"/>
    <component id="com.yahoo.config.provision.NodeFlavors" bundle="config-provisioning" />

    <rest-api path="hack" jersey2="true">
        <components bundle="node-repository" />
    </rest-api>

    <handler id="com.yahoo.vespa.hosted.provision.restapi.v2.NodesApiHandler" bundle="node-repository">
        <binding>http://*/nodes/v2/*</binding>
        <binding>https://*/nodes/v2/*</binding>
    </handler>

    <document-api />
    <search />
    <nodes count="2" flavor="large"/>
  </container>

  <content id="music" version="1.0">
    <redundancy>1</redundancy>
    <documents>
      <document type="music" mode="index" />
    </documents>
  </content>

</services>

The API reports that the nodes I added are in the 'provisioned' state but I can't tell if it needs to be made 'ready' somehow before deploying an application. I have confirmed that the service nodes have connections to the configserver on port 19070.

$ docker exec -it vespa1 curl -X GET http://localhost:19071/nodes/v2/state/provisioned
{"nodes":[{"url":"http://localhost:19071/nodes/v2/node/vespa-container-5"},{"url":"http://localhost:19071/nodes/v2/node/vespa-container-4"},{"url":"http://localhost:19071/nodes/v2/node/vespa-container-6"},{"url":"http://localhost:19071/nodes/v2/node/vespa-container-3"},{"url":"http://localhost:19071/nodes/v2/node/vespa-container-2"}]}v

Are there some steps I am missing here?

bratseth commented 6 years ago

@paolodedios do you really need to do this, is it the same project? It is only relevant if you want to set up your own multi-tenant PaaS cloud, and we don't have end user documentation for that. You don't need it to create a multi-node application.

2 problems here:

paolodedios commented 6 years ago

@bratseth I am exploring how to make Vespa a more Kubernetes native application so that it fits within our environment. I would like to deploy Vespa as Kubernetes pods that can be managed completely through kube native tooling. Kubernetes abstracts away node and network management enough that it's cumbersome to do everything through configuration files and would prefer to automate the ip address discovery and environment configuration process through API calls. For Vespa, this means finding ways of configuring node lists dynamically as pods are brought up and down by Kubernetes' cluster autoscaler or its automated rolling-update & deployment system. Vespa's node-repository and node API seems like a good fit for integrating with Kubernetes and your per-node host-admin can be automated via kubernetes to mark nodes provisioned/dirty/ready/inactive.

bratseth commented 6 years ago

Ok, yes that makes sense. Try to move those components to the configserver app then. As I said you also need to move the nodes to ready after provisioning, but if you use Kubernetes our host-admin is probably not the way to go. If you know they are ready just move them immediately after provisioning, using the same web service API. Lmk if this helps.

paolodedios commented 6 years ago

@bratseth Thank you for the tip. It is very helpful.

Turns out that those NodeRepository declarations do not need to be imported to configserver-app/services.xml since node-repository.xml is declared as part of a directive.

Also, since the internal node-admin provisions containers for nodes with "flavor=DOCKER_CONTAINER", I've had to change the configuration to "flavor=VIRTUAL_MACHINE" or "flavor=BARE_METAL" in order to successfully deploy an app to nodes. Are there any side-effects to declaring nodes with those flavors?

bratseth commented 6 years ago

No, that is fine.