ovh / cds

Enterprise-Grade Continuous Delivery & DevOps Automation Open Source Platform
https://ovh.github.io/cds/
BSD 3-Clause "New" or "Revised" License
4.56k stars 419 forks source link

Docker swarm hatchery "internal server error (caused by: internal server error)" #4827

Closed AndrzejA35 closed 4 years ago

AndrzejA35 commented 4 years ago

Hello guys,

I do have a 0.42 version of the CDS software installed in the "binary" mode on my Centos 7 machine.

My goal was to switch from hatchery-local to hatchery-swarm, however, this seems to be impossible in binary mode (i have managed to set this up in CDS docker mode). I would like to keep my configuration in binary mode rather than migrating to docker.

I did use DEMO project model.

Below is my CDS hatchery swarm configuration.

# Hatchery Swarm. Doc: https://ovh.github.io/cds/docs/integrations/swarm/
  [hatchery.swarm]

    # Worker default memory in Mo
    defaultMemory = 1024

    # Docker Options. --add-host and --privileged supported. Example: dockerOpts="--add-host=myhost:x.x.x.x,myhost2:y.y.y.y --privileged"
    # dockerOpts = ""

    # Max Containers on Host managed by this Hatchery
    maxContainers = 10

    # if true: hatchery creates private network between services with ipv6 enabled
    networkEnableIPv6 = false

    # Percent reserved for spawning worker with service requirement
    ratioService = 75

    # Worker TTL (minutes)
    workerTTL = 10

    [hatchery.swarm.commonConfiguration]

      # Name of Hatchery
      name = "main-swarm"

      # URL of this Hatchery
      url = "http://localhost:8889"

      [hatchery.swarm.commonConfiguration.api]

        # Maximum allowed consecutives failures on heatbeat routine
        maxHeartbeatFailures = 10

        # Request CDS API: timeout in seconds
        requestTimeout = 10

        # CDS Token to reach CDS API. See https://ovh.github.io/cds/docs/components/cdsctl/token/ 
        token = "top_secret_token"

        [hatchery.swarm.commonConfiguration.api.grpc]

          # sslInsecureSkipVerify, set to true if you use a self-signed SSL on CDS API
          # insecure = false
          url = "http://localhost:8882"

        [hatchery.swarm.commonConfiguration.api.http]

          # sslInsecureSkipVerify, set to true if you use a self-signed SSL on CDS API
          # insecure = false

          # CDS API URL
          url = "http://localhost:8881"

      ######################
      # CDS Hatchery HTTP Configuration 
      #######################
      [hatchery.swarm.commonConfiguration.http]

        # Listen address without port, example: 127.0.0.1
        # addr = ""
        port = 8889

      # Hatchery Log Configuration
      [hatchery.swarm.commonConfiguration.logOptions]

        [hatchery.swarm.commonConfiguration.logOptions.spawnOptions]

          # log critical if spawn take more than this value (in seconds)
          thresholdCritical = 480

          # log warning if spawn take more than this value (in seconds)
          thresholdWarning = 360

      [hatchery.swarm.commonConfiguration.provision]

        # Disabled provisioning. Format:true or false
        disabled = false

        # Check provisioning each n Seconds
        frequency = 30

        # if worker is queued less than this value (seconds), hatchery does not take care of it
        graceTimeQueued = 4

        # Maximum allowed simultaneous workers provisioning
        maxConcurrentProvisioning = 10

        # Maximum allowed simultaneous workers registering. -1 to disable registering on this hatchery
        maxConcurrentRegistering = 2

        # Maximum allowed simultaneous workers
        maxWorker = 10

        # Check if some worker model have to be registered each n Seconds
        registerFrequency = 60

        # Worker Log Configuration
        [hatchery.swarm.commonConfiguration.provision.workerLogsOptions]

          [hatchery.swarm.commonConfiguration.provision.workerLogsOptions.graylog]

            # Example: X-OVH-TOKEN. You can use many keys: aaa,bbb
            extraKey = ""

            # value for extraKey field. For many keys: valueaaa,valuebbb
            extraValue = ""

            # Example: thot.ovh.com
            host = ""

            # Example: 12202
            port = 0

            # tcp or udp
            protocol = "tcp"

    # List of Docker Engines
    [hatchery.swarm.dockerEngines]

      [hatchery.swarm.dockerEngines.sample-docker-engine]

        # DOCKER_API_VERSION
        APIVersion = ""

        # content of your ca.pem
        TLSCAPEM = ""

        # content of your cert.pem
        TLSCERTPEM = ""

        # content of your key.pem
        TLSKEYPEM = ""

        # DOCKER_CERT_PATH
        certPath = ""

        # DOCKER_HOST
        host = "tcp://localhost:2375"

        # DOCKER_INSECURE_SKIP_TLS_VERIFY
        insecureSkipTLSVerify = true

        # Max Containers on Host managed by this Hatchery
        maxContainers = 10

And logs from the hatchery swarm service:

Starting service hatchery:swarm
2019-12-14 17:29:31 [INFO] hatchery:swarm> Service registered
2019-12-14 17:29:31 [INFO] main-swarm> Starting service main-swarm (0.42.0+cds.11338)...
2019-12-14 17:29:31 [DEBUG] main-swarm> Router initialized
2019-12-14 17:29:31 [INFO] Registering handler api.VersionHandler on GET /mon/version
2019-12-14 17:29:31 [INFO] Registering handler hatchery.getStatusHandler.1 on GET /mon/status
2019-12-14 17:29:31 [INFO] Registering handler hatchery.getWorkersPoolHandler.1 on GET /mon/workers
2019-12-14 17:29:31 [INFO] Registering handler api/observability.StatsHandler on GET /mon/metrics
2019-12-14 17:29:31 [INFO] Registering handler hatchery.(*Common).getPanicDumpListHandler on GET /mon/errors
2019-12-14 17:29:31 [INFO] Registering handler hatchery.(*Common).getPanicDumpHandler on GET /mon/errors/{id}
2019-12-14 17:29:31 [INFO] hatchery> Stats initialized on cds-hatchery-swarm
2019-12-14 17:29:31 [INFO] main-swarm> Starting HTTP Server on port 8889
2019-12-14 17:29:31 [INFO] hatchery> swarm> connecting to sample-docker-engine: tcp://localhost:2375
2019-12-14 17:29:31 [INFO] hatchery> swarm> connected to sample-docker-engine (tcp://localhost:2375)
2019-12-14 17:30:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:30:02 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:30:02 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:30:02 [DEBUG] hatchery> no model
2019-12-14 17:30:11 [DEBUG] job 479 already spawned in previous routine
2019-12-14 17:30:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:30:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:30:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:30:31 [DEBUG] Serve>CommonServe>Create>workerRegister>WorkerModelBook: Worker Model already booked (caused by: cannot book model go-official-1.11.4-stretch with id 3: Worker Model already booked)
2019-12-14 17:30:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:30:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:30:31 [DEBUG] hatchery> no model
2019-12-14 17:30:51 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:30:51 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:30:51 [DEBUG] hatchery> no model
2019-12-14 17:31:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:31:11 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:31:11 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:31:11 [DEBUG] hatchery> no model
2019-12-14 17:31:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:31:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:31:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:31:31 [DEBUG] hatchery> no model
2019-12-14 17:31:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:31:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:31:31 [DEBUG] Serve>CommonServe>Create>workerRegister>WorkerModelBook: Worker Model already booked (caused by: cannot book model go-official-1.11.4-stretch with id 3: Worker Model already booked)
2019-12-14 17:31:51 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:31:51 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:31:51 [DEBUG] hatchery> no model
2019-12-14 17:32:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:32:11 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:32:11 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:32:11 [DEBUG] hatchery> no model
2019-12-14 17:32:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:32:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:32:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:32:31 [DEBUG] Serve>CommonServe>Create>workerRegister>WorkerModelBook: Worker Model already booked (caused by: cannot book model go-official-1.11.4-stretch with id 3: Worker Model already booked)
2019-12-14 17:32:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:32:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:32:31 [DEBUG] hatchery> no model
2019-12-14 17:32:51 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:32:51 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:32:51 [DEBUG] hatchery> no model
2019-12-14 17:33:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:33:11 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:33:11 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:33:11 [DEBUG] hatchery> no model
2019-12-14 17:33:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:33:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:33:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:33:31 [DEBUG] Serve>CommonServe>Create>workerRegister>WorkerModelBook: Worker Model already booked (caused by: cannot book model go-official-1.11.4-stretch with id 3: Worker Model already booked)
2019-12-14 17:33:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:33:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:33:31 [DEBUG] hatchery> no model
2019-12-14 17:33:51 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:33:51 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:33:51 [DEBUG] hatchery> no model
2019-12-14 17:34:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:34:11 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:34:11 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:34:11 [DEBUG] hatchery> no model
2019-12-14 17:34:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:34:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:34:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:34:31 [DEBUG] hatchery> no model
2019-12-14 17:34:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:34:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:34:31 [INFO] hatchery> workerRegister> spawning model go-official-1.11.4-stretch (3)
2019-12-14 17:34:31 [DEBUG] Spawning worker for register model go-official-1.11.4-stretch
2019-12-14 17:34:31 [DEBUG] hatchery> swarm> SpawnWorker> Spawning worker register-swarmy-go-official-1.11.4-stretch-pedantic-galois - spawn for register
2019-12-14 17:34:31 [INFO] hatchery> swarm> createAndStartContainer> Create container register-swarmy-go-official-1.11.4-stretch-pedantic-galois on sample-docker-engine from golang:1.11.4-stretch (memory=128MB)
2019-12-14 17:34:41 [DEBUG] hatchery> swarm> killAwolWorker> Delete worker /register-swarmy-go-official-1.11.4-stretch-pedantic-galois on sample-docker-engine
2019-12-14 17:34:41 [DEBUG] checking last registration date of go-official-1.11.4-stretch: 0001-01-01 01:24:00 +0124 WMT (true)
2019-12-14 17:34:41 [ERROR] hatchery> swarm> killAndRemove> error on call client.WorkerModelSpawnError on worker model 3 for register: GoRoutine>routines>killAwolWorker>killAndRemove>WorkerModelSpawnError: internal server error (caused by: internal server error)
2019-12-14 17:34:41 [DEBUG] hatchery> swarm> killAndRemove> remove container 0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008 on sample-docker-engine
2019-12-14 17:34:51 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:34:51 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:34:51 [DEBUG] hatchery> no model
2019-12-14 17:35:01 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:35:11 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:35:11 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:35:11 [DEBUG] hatchery> no model
2019-12-14 17:35:31 [DEBUG] hatchery> swarm> WorkersStartedByModel> go-official-1.11.4-stretch      0
2019-12-14 17:35:31 [DEBUG] hatchery> workerRegister> need register
2019-12-14 17:35:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:35:31 [DEBUG] Serve>CommonServe>Create>workerRegister>WorkerModelBook: Worker Model already booked (caused by: cannot book model go-official-1.11.4-stretch with id 3: Worker Model already booked)
2019-12-14 17:35:31 [DEBUG] hatchery> checkCapacities> 0.000 seconds elapsed
2019-12-14 17:35:31 [DEBUG] canRunJob> model go-official-1.11.4-stretch needs registration
2019-12-14 17:35:31 [DEBUG] hatchery> no model

Docker events:

2019-12-14T17:34:31.538431054+01:00 container create 0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008 (hatchery=main-swarm, image=golang:1.11.4-stretch, name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_model=3, worker_name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_requirements=)
2019-12-14T17:34:31.559624337+01:00 network connect 2f3cabee94e3b1a40658665815e771eb8ba5481afc842b091632a1e16de0fb1d (container=0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008, name=bridge, type=bridge)
2019-12-14T17:34:31.785275926+01:00 container start 0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008 (hatchery=main-swarm, image=golang:1.11.4-stretch, name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_model=3, worker_name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_requirements=)
2019-12-14T17:34:31.963100183+01:00 container die 0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008 (exitCode=7, hatchery=main-swarm, image=golang:1.11.4-stretch, name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_model=3, worker_name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_requirements=)
2019-12-14T17:34:32.034591272+01:00 network disconnect 2f3cabee94e3b1a40658665815e771eb8ba5481afc842b091632a1e16de0fb1d (container=0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008, name=bridge, type=bridge)
2019-12-14T17:34:41.479266413+01:00 container destroy 0c6978008eec538baf93110add89459e85a8194b8e10754bf413806010236008 (hatchery=main-swarm, image=golang:1.11.4-stretch, name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_model=3, worker_name=register-swarmy-go-official-1.11.4-stretch-pedantic-galois, worker_requirements=)

I don't understand why containers die immediately after being created. And what should be done to have this working?

AndrzejA35 commented 4 years ago

Hello again.

I have resolved this issue and would like to share the root cause so that everyone could benefit from my investigation.

The root cause of this issue was the fact that the container couldn't reach CDS API being hosted in "host". I couldn't find the proof, because the containers were constantly being killed. I have added containers logs to the Elastic and thankfully Filebeat was able to catch all the logs from the container. Those logs confirmed connectivity issues.

Below is the final configuration of the worker model.

curl ${CDS_API}/download/worker/linux/$(uname -m) -o worker --retry 10 --retry-max-time 120 && chmod +x worker && exec ./worker

ENV VARIABLES:
...
CDS_API: http://dockerhost:XXXX
...

Bear in mind that docker container should be able to access your host. This can be done in many ways, depends on your configuration. I did this previously for other containers by adding a docker network interface to a trusted zone (CentOS) and adding docker host plus port forwarding.

I believe this issue can be closed.