tinkerbell / tink

Workflow Engine for provisioning Bare Metal
https://tinkerbell.org
Apache License 2.0
917 stars 134 forks source link

Containers not loading on target machine when deploying with newest tink (2021-04-09 commit) #480

Closed lukaskalvenas closed 3 years ago

lukaskalvenas commented 3 years ago

The target machine boots into PXE, successfully receives predetermined DHCP lease and then Alpine Linux boots. Upon entering with "root" and performing "docker ps -a" no containers ever start running, therefore deployment never begins.

I'm creating this issue because I've now been using Tinkerbell from around August 2020 with the same templates and Docker images (only newer and newer Tinkerbell versions) and until now had no similar problems.

Expected Behaviour

One or more docker containers (depending on install template) should connect and perform deployment tasks.

Current Behaviour

I'm sharing part of my current configuration below for context:

Docker container output

CONTAINER ID   IMAGE                             COMMAND                  CREATED        STATUS                  PORTS                                  NAMES
9dc9265cd3b0   deploy_tink-cli                   "/bin/sh -c 'sleep i…"   25 hours ago   Up 24 hours                                                    deploy_tink-cli_1
6704e8678c2f   deploy_tink-server                "/usr/bin/tink-server"   25 hours ago   Up 24 hours (healthy)   0.0.0.0:42113-42114->42113-42114/tcp   deploy_tink-server_1
ab6551ff7563   quay.io/tinkerbell/hegel:latest   "/usr/bin/hegel"         25 hours ago   Up 24 hours                                                    deploy_hegel_1
ddb109c0f30b   quay.io/tinkerbell/boots:latest   "/usr/bin/boots -dhc…"   25 hours ago   Up 24 hours                                                    deploy_boots_1
a8a08abc5ff8   postgres:10-alpine                "docker-entrypoint.s…"   25 hours ago   Up 24 hours (healthy)   0.0.0.0:5432->5432/tcp                 deploy_db_1
4e7439ea5cc8   nginx:alpine                      "/docker-entrypoint.…"   25 hours ago   Up 24 hours             192.168.1.2:80->80/tcp              deploy_nginx_1
81ada2cac06f   deploy_registry                   "/entrypoint.sh /etc…"   25 hours ago   Up 24 hours (healthy)                                          deploy_registry_1

Docker image output

root@tinkerbell-provisioner:~# docker images
REPOSITORY                       TAG            
192.168.1.1/install-root-fs      v1    
192.168.1.1/disk-partition       v1   
192.168.1.1/disk-wipe            v1      
192.168.1.1/tink-worker          latest   

Install template example No. 1

version: '0.1'
name: os-install
global_timeout: 6000
tasks:
- name: "os-installation"
  worker: "{{.device_1}}"
  volumes:
    - /dev:/dev
    - /dev/console:/dev/console
    - /lib/firmware:/lib/firmware:ro
  actions:
  - name: "disk-wipe"
    image: disk-wipe:v1
    timeout: 90
  - name: "disk-partition"
    image: disk-partition:v1
    timeout: 180
    environment:
       MIRROR_HOST: 192.168.1.1
    volumes:
      - /statedir:/statedir
  - name: "install-root-fs"
    image: install-root-fs:v1
    timeout: 600
    environment:
       MIRROR_HOST: 192.168.1.2
  - name: "install-grub"
    image: install-grub:v1
    timeout: 600
    environment:
       MIRROR_HOST: 192.168.1.2
    volumes:
      - /statedir:/statedir

Install template example No. 2

version: "0.1"
name: Windows_deployment
global_timeout: 1800
tasks:
  - name: "os-installation"
    worker: "{{.device_1}}"
    volumes:
      - /dev:/dev
      - /dev/console:/dev/console
      - /lib/firmware:/lib/firmware:ro
    actions:
      - name: "stream-windows-imageimage"
        image: image2disk:v1
        timeout: 600
        environment:
          DEST_DISK: /dev/sda
          IMG_URL: "http://192.168.1.2/misc/osie/current/windows_2012/tink-windows_2012.raw.gz"
          COMPRESSED: true
      - name: "reboot"
        image: reboot:v1
        timeout: 600
        environment:
           MIRROR_HOST: 192.168.1.2
        volumes:
          - /statedir:/statedir
          - /var/run/docker.sock:/var/run/docker.sock
          - /etc/docker:/etc/docker
          - /root:/root

I'm also sharing log output from Alpine's Docker daemon:

time="2021-04-11T12:59:26.764239670Z" level=info msg="Loading containers: start."
time="2021-04-11T12:59:26.952756294Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
time="2021-04-11T12:59:26.979995781Z" level=info msg="Loading containers: done."
time="2021-04-11T12:59:26.984795421Z" level=info msg="Docker daemon" commit=48a66213fe1747e8873f849862ff3fb981899fc6 graphdriver(s)=overlay2 version=19.03.12
time="2021-04-11T12:59:26.984831305Z" level=info msg="Daemon has completed initialization"
time="2021-04-11T12:59:27.032140270Z" level=info msg="API listen on /var/run/docker.sock"
time="2021-04-11T12:59:29.588365517Z" level=info msg="Attempting next endpoint for pull after error: manifest unknown: manifest unknown"

Steps to Reproduce (for bugs)

  1. Use the currently newest Tinkerbell (2021-04-09 commit)
  2. Contact me either here on GitHub or on Equinix Metal's public slack #tinkerbell channel (my nickname is Luke) as I don't want to share too much Docker image and template info here. I'll gladly share more on a private conversation.

Your Environment

lukaskalvenas commented 3 years ago

Issue fixed by combining settings from sandbox.

EDIT: will be updated with findings.

jimmyat commented 3 years ago

Issue fixed by combining settings from sandbox.

EDIT: will be updated with findings.

@lukaskalvenas how did you end up fixing this? I'm getting the same problem.

lukaskalvenas commented 3 years ago

@jimmyat pull my version of forked Tinkerbell from here: https://github.com/lukaskalvenas/tink/. The files I edited were the following:

  1. generate-envrc.sh
  2. setup.sh
  3. Added current_versions.sh
  4. ./deploy/docker-compose.yml

Also, with my forked files, use ./generate-envrc $iface > .env

tstromberg commented 3 years ago

@lukaskalvenas - now that https://github.com/tinkerbell/tink/issues/481 has been closed with a PR, can this issue be closed as well?

lukaskalvenas commented 3 years ago

@tstromberg probably... I fixed it on my end and provided information what needs changing. I don't know if that was addressed in the latest commit.