Docker swarm hatchery "internal server error (caused by: internal server error)"

Hello guys,

I do have a 0.42 version of the CDS software installed in the "binary" mode on my Centos 7 machine.

My goal was to switch from hatchery-local to hatchery-swarm, however, this seems to be impossible in binary mode (i have managed to set this up in CDS docker mode). I would like to keep my configuration in binary mode rather than migrating to docker.

I did use DEMO project model.

Below is my CDS hatchery swarm configuration.

# Hatchery Swarm. Doc: https://ovh.github.io/cds/docs/integrations/swarm/
  [hatchery.swarm]

    # Worker default memory in Mo
    defaultMemory = 1024

    # Docker Options. --add-host and --privileged supported. Example: dockerOpts="--add-host=myhost:x.x.x.x,myhost2:y.y.y.y --privileged"
    # dockerOpts = ""

    # Max Containers on Host managed by this Hatchery
    maxContainers = 10

    # if true: hatchery creates private network between services with ipv6 enabled
    networkEnableIPv6 = false

    # Percent reserved for spawning worker with service requirement
    ratioService = 75

    # Worker TTL (minutes)
    workerTTL = 10

    [hatchery.swarm.commonConfiguration]

      # Name of Hatchery
      name = "main-swarm"

      # URL of this Hatchery
      url = "http://localhost:8889"

      [hatchery.swarm.commonConfiguration.api]

        # Maximum allowed consecutives failures on heatbeat routine
        maxHeartbeatFailures = 10

        # Request CDS API: timeout in seconds
        requestTimeout = 10

        # CDS Token to reach CDS API. See https://ovh.github.io/cds/docs/components/cdsctl/token/ 
        token = "top_secret_token"

        [hatchery.swarm.commonConfiguration.api.grpc]

          # sslInsecureSkipVerify, set to true if you use a self-signed SSL on CDS API
          # insecure = false
          url = "http://localhost:8882"

        [hatchery.swarm.commonConfiguration.api.http]

          # sslInsecureSkipVerify, set to true if you use a self-signed SSL on CDS API
          # insecure = false

          # CDS API URL
          url = "http://localhost:8881"

      ######################
      # CDS Hatchery HTTP Configuration 
      #######################
      [hatchery.swarm.commonConfiguration.http]

        # Listen address without port, example: 127.0.0.1
        # addr = ""
        port = 8889

      # Hatchery Log Configuration
      [hatchery.swarm.commonConfiguration.logOptions]

        [hatchery.swarm.commonConfiguration.logOptions.spawnOptions]

          # log critical if spawn take more than this value (in seconds)
          thresholdCritical = 480

          # log warning if spawn take more than this value (in seconds)
          thresholdWarning = 360

      [hatchery.swarm.commonConfiguration.provision]

        # Disabled provisioning. Format:true or false
        disabled = false

        # Check provisioning each n Seconds
        frequency = 30

        # if worker is queued less than this value (seconds), hatchery does not take care of it
        graceTimeQueued = 4

        # Maximum allowed simultaneous workers provisioning
        maxConcurrentProvisioning = 10

        # Maximum allowed simultaneous workers registering. -1 to disable registering on this hatchery
        maxConcurrentRegistering = 2

        # Maximum allowed simultaneous workers
        maxWorker = 10

        # Check if some worker model have to be registered each n Seconds
        registerFrequency = 60

        # Worker Log Configuration
        [hatchery.swarm.commonConfiguration.provision.workerLogsOptions]

          [hatchery.swarm.commonConfiguration.provision.workerLogsOptions.graylog]

            # Example: X-OVH-TOKEN. You can use many keys: aaa,bbb
            extraKey = ""

            # value for extraKey field. For many keys: valueaaa,valuebbb
            extraValue = ""

            # Example: thot.ovh.com
            host = ""

            # Example: 12202
            port = 0

            # tcp or udp
            protocol = "tcp"

    # List of Docker Engines
    [hatchery.swarm.dockerEngines]

      [hatchery.swarm.dockerEngines.sample-docker-engine]

        # DOCKER_API_VERSION
        APIVersion = ""

        # content of your ca.pem
        TLSCAPEM = ""

        # content of your cert.pem
        TLSCERTPEM = ""

        # content of your key.pem
        TLSKEYPEM = ""

        # DOCKER_CERT_PATH
        certPath = ""

        # DOCKER_HOST
        host = "tcp://localhost:2375"

        # DOCKER_INSECURE_SKIP_TLS_VERIFY
        insecureSkipTLSVerify = true

        # Max Containers on Host managed by this Hatchery
        maxContainers = 10

And logs from the hatchery swarm service:

Starting service hatchery:swarm
2019-12-14 17:29:31 [INFO] hatchery:swarm> Service registered
2019-12-14 17:29:31 [INFO] main-swarm> Starting service main-swarm (0.42.0+cds.11338)...
2019-12-14 17:29:31 [DEBUG] main-swarm> Router initialized
2019-12-14 17:29:31 [INFO] Registering handler api.VersionHandler on GET /mon/version
2019-12-14 17:29:31 [INFO] Registering handler hatchery.getStatusHandler.1 on GET /mon/status
2019-12-14 17:29:31 [INFO] Registering handler hatchery.getWorkersPoolHandler.1 on GET /mon/workers
2019-12-14 17:29:31 [INFO] Registering handler api/observability.StatsHandler on GET /mon/metrics
2019-12-14 17:29:31 [INFO] Registering handler hatchery.(*Common).getPanicDumpListHandler on GET /mon/errors
2019-12-14 17:29:31 [INFO] Registering handler hatchery.(*Common).getPanicDumpHandler on GET /mon/errors/{id}
2019-12-14 17:29:31 [INFO] hatchery> Stats initialized on cds-hatchery-swarm
2019-12-14 17:29:31 [INFO] main-swarm> Starting HTTP Server on port 8889
2019-12-14 17:29:31 [INFO] hatchery> swarm> connecting to sample-docker-engine: tcp://localhost:2375
2019-12-14 17:29:31 [INFO] hatchery> swarm> connected to sample-docker-engine (tcp://localhost:2375)
2019-12-14 17:30:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:30:02 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:30:02 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:30:02 [DEBUG] hatchery> no model
2019-12-14 17:30:11 [DEBUG] job 479 already spawned in previous routine
2019-12-14 17:30:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:30:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:30:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:30:31 [DEBUG] Serve>CommonServe>Create>workerRegister>WorkerModelBook: Worker Model already booked (caused by: cannot book model go-official-1.11.4-stretch with id 3: Worker Model already booked)
2019-12-14 17:30:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:30:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:30:31 [DEBUG] hatchery> no model
2019-12-14 17:30:51 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:30:51 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:30:51 [DEBUG] hatchery> no model
2019-12-14 17:31:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:31:11 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:31:11 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:31:11 [DEBUG] hatchery> no model
2019-12-14 17:31:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:31:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:31:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:31:31 [DEBUG] hatchery> no model
2019-12-14 17:31:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:31:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:31:31 [DEBUG] Serve>CommonServe>Create>workerRegister>WorkerModelBook: Worker Model already booked (caused by: cannot book model go-official-1.11.4-stretch with id 3: Worker Model already booked)
2019-12-14 17:31:51 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:31:51 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:31:51 [DEBUG] hatchery> no model
2019-12-14 17:32:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:32:11 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:32:11 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:32:11 [DEBUG] hatchery> no model
2019-12-14 17:32:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:32:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:32:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:32:31 [DEBUG] Serve>CommonServe>Create>workerRegister>WorkerModelBook: Worker Model already booked (caused by: cannot book model go-official-1.11.4-stretch with id 3: Worker Model already booked)
2019-12-14 17:32:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:32:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:32:31 [DEBUG] hatchery> no model
2019-12-14 17:32:51 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:32:51 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:32:51 [DEBUG] hatchery> no model
2019-12-14 17:33:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:33:11 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:33:11 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:33:11 [DEBUG] hatchery> no model
2019-12-14 17:33:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:33:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:33:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:33:31 [DEBUG] Serve>CommonServe>Create>workerRegister>WorkerModelBook: Worker Model already booked (caused by: cannot book model go-official-1.11.4-stretch with id 3: Worker Model already booked)
2019-12-14 17:33:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:33:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:33:31 [DEBUG] hatchery> no model
2019-12-14 17:33:51 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:33:51 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:33:51 [DEBUG] hatchery> no model
2019-12-14 17:34:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:34:11 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:34:11 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:34:11 [DEBUG] hatchery> no model
2019-12-14 17:34:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:34:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:34:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:34:31 [DEBUG] hatchery> no model
2019-12-14 17:34:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:34:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:34:31 [INFO] hatchery> workerRegister> spawning model go-official-1.11.4-stretch (3)
2019-12-14 17:34:31 [DEBUG] Spawning worker for register model go-official-1.11.4-stretch
2019-12-14 17:34:31 [DEBUG] hatchery> swarm> SpawnWorker> Spawning worker register-swarmy-go-official-1.11.4-stretch-pedantic-galois - spawn for register
2019-12-14 17:34:31 [INFO] hatchery> swarm> createAndStartContainer> Create container register-swarmy-go-official-1.11.4-stretch-pedantic-galois on sample-docker-engine from golang:1.11.4-stretch (memory=128MB)
2019-12-14 17:34:41 [DEBUG] hatchery> swarm> killAwolWorker> Delete worker /register-swarmy-go-official-1.11.4-stretch-pedantic-galois on sample-docker-engine
2019-12-14 17:34:41 [DEBUG] checking last registration date of go-official-1.11.4-stretch: 0001-01-01 01:24:00 +0124 WMT (true)
2019-12-14 17:34:41 [ERROR] hatchery> swarm> killAndRemove> error on call client.WorkerModelSpawnError on worker model 3 for register: GoRoutine>routines>killAwolWorker>killAndRemove>WorkerModelSpawnError: internal server error (caused by: internal server error)
2019-12-14 17:34:41 [DEBUG] hatchery> swarm> killAndRemove> remove container 0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008 on sample-docker-engine
2019-12-14 17:34:51 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:34:51 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:34:51 [DEBUG] hatchery> no model
2019-12-14 17:35:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:35:11 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:35:11 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:35:11 [DEBUG] hatchery> no model
2019-12-14 17:35:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:35:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:35:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:35:31 [DEBUG] Serve>CommonServe>Create>workerRegister>WorkerModelBook: Worker Model already booked (caused by: cannot book model go-official-1.11.4-stretch with id 3: Worker Model already booked)
2019-12-14 17:35:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:35:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:35:31 [DEBUG] hatchery> no model

Docker events:

2019-12-14T17:34:31.538431054+01:00 container create 0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008 (hatchery=main-swarm, image=golang:1.11.4-stretch, name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_model=3, worker_name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_requirements=)
2019-12-14T17:34:31.559624337+01:00 network connect 2f3cabee94e3b1a40658665815e771eb8ba5481afc842b091632a1e16de0fb1d (container=0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008, name=bridge, type=bridge)
2019-12-14T17:34:31.785275926+01:00 container start 0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008 (hatchery=main-swarm, image=golang:1.11.4-stretch, name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_model=3, worker_name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_requirements=)
2019-12-14T17:34:31.963100183+01:00 container die 0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008 (exitCode=7, hatchery=main-swarm, image=golang:1.11.4-stretch, name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_model=3, worker_name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_requirements=)
2019-12-14T17:34:32.034591272+01:00 network disconnect 2f3cabee94e3b1a40658665815e771eb8ba5481afc842b091632a1e16de0fb1d (container=0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008, name=bridge, type=bridge)
2019-12-14T17:34:41.479266413+01:00 container destroy 0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008 (hatchery=main-swarm, image=golang:1.11.4-stretch, name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_model=3, worker_name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_requirements=)

I don't understand why containers die immediately after being created. And what should be done to have this working?

ovh / cds

Docker swarm hatchery "internal server error (caused by: internal server error)" #4827