oxheadalpha / tezos-k8s

Deploy a Tezos Blockchain on Kubernetes
https://tezos-k8s.io/
MIT License
50 stars 26 forks source link

Forced re-building with devspace produces strange behavior #233

Open hai-nguyen-van opened 3 years ago

hai-nguyen-van commented 3 years ago

After discussions with @elric1, I found a strange behavior when re-building the Docker image tezos-k8s-utils using devspace build -b -t dev --skip-push. I faced this warning and ignored it at first:

[warn]   Newly built image 'tezos-k8s-zerotier' has the same tag as in the last build (dev), this can lead to problems that the image during deployment is not updated

When trying to deploy the testnet, I would keep having Init:CrashLoopBackOff without further explanations:

NAME                  READY   STATUS                  RESTARTS   AGE
activate-job-7g2rg    0/1     Init:Error              0          64s
activate-job-dvqkr    0/1     Init:Error              0          94s
activate-job-kbzjc    0/1     Init:Error              0          97s
activate-job-lmck9    0/1     Init:Error              0          24s
activate-job-rlxm9    0/1     Init:Error              0          84s
tezos-baking-node-0   0/2     Init:Error              4          97s
tezos-baking-node-1   0/2     Init:CrashLoopBackOff   3          97s

The above warning message could be amended in a way to insist on the fact that

  1. the Docker image will not be overwritten
  2. it is necessary to delete manually the Docker image to properly re-build with devspace
harryttd commented 3 years ago

The above warning message could be amended in a way to insist on the fact that...

The warning message comes from devspace. Not from our code so not really anything we can do about that. Also, you should not need to delete the docker image. It should rebuild just fine. You are using the force rebuild tag -b.

You can skip building zerotier by commenting it out of the images list in devspace.yaml.

What are the events listed when you describe the pods?

hai-nguyen-van commented 3 years ago

Mhhhhh that's really weird. I really had the impression that force rebuild using option -b did not operate as expected. Here's what describe displayed:

Name:         tezos-baking-node-0
Namespace:    tezos-testnet
Priority:     0
Node:         minikube/192.168.49.2
Start Time:   Fri, 23 Jul 2021 15:11:41 +0200
Labels:       app=tezos-baking-node
              appType=tezos-node
              controller-revision-hash=tezos-baking-node-7486d4d774
              statefulset.kubernetes.io/pod-name=tezos-baking-node-0
Annotations:  <none>
Status:       Pending
IP:           172.17.0.4
IPs:
  IP:           172.17.0.4
Controlled By:  StatefulSet/tezos-baking-node
Init Containers:
  wait-for-bootstrap:
    Container ID:  docker://8828facc0f6cfa4a549cb0bf498fdd30341af45f7b25132e963b29d0654792db
    Image:         tezos-k8s-utils:dev
    Image ID:      docker://sha256:2226b4a2a75bd3cae9928a157472570ceb526c642b68d80a6d3ade68c574e16b
    Port:          <none>
    Host Port:     <none>
    Args:
      wait-for-bootstrap
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 23 Jul 2021 15:14:46 +0200
      Finished:     Fri, 23 Jul 2021 15:14:46 +0200
    Ready:          False
    Restart Count:  5
    Environment Variables from:
      tezos-config  ConfigMap  Optional: false
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nf8zz (ro)
      /var/tezos from var-volume (rw)
  config-generator:
    Container ID:  
    Image:         tezos-k8s-utils:dev
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      config-generator
      --generate-config-json
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment Variables from:
      tezos-secret  Secret     Optional: false
      tezos-config  ConfigMap  Optional: false
    Environment:
      MY_POD_IP:       (v1:status.podIP)
      MY_POD_NAME:    tezos-baking-node-0 (v1:metadata.name)
      MY_POD_TYPE:    node
      MY_NODE_CLASS:  tezos-baking-node
    Mounts:
      /etc/tezos from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nf8zz (ro)
      /var/tezos from var-volume (rw)
Containers:
  tezos-node:
    Container ID:  
    Image:         tezo:latest
    Image ID:      
    Ports:         8732/TCP, 9732/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /bin/sh
    Args:
      -c
      set -x

      set

      #
      # Not every error is fatal on start.  In particular, with zerotier,
      # the listen-addr may not yet be bound causing tezos-node to fail.
      # So, we try a few times with increasing delays:

      for d in 1 1 5 10 20 60 120; do
        /usr/local/bin/tezos-node run                               \
                                         --bootstrap-threshold 0      \
                                         --config-file /etc/tezos/config.json
        sleep $d
      done

      #
      # Keep the container alive for troubleshooting on failures:

      sleep 3600

    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/tezos from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nf8zz (ro)
      /var/tezos from var-volume (rw)
  baker-alpha:
    Container ID:  
    Image:         tezo:latest
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      set -ex

      TEZ_VAR=/var/tezos
      TEZ_BIN=/usr/local/bin
      CLIENT_DIR="$TEZ_VAR/client"
      NODE_DIR="$TEZ_VAR/node"
      NODE_DATA_DIR="$TEZ_VAR/node/data"

      proto_command="alpha"

      if [ "${DAEMON}" == "baker" ]; then
          extra_args="with local node $NODE_DATA_DIR"
      fi

      my_baker_account="$(cat /etc/tezos/baker-account )"

      CLIENT="$TEZ_BIN/tezos-client -d $CLIENT_DIR"
      CMD="$TEZ_BIN/tezos-$DAEMON-$proto_command -d $CLIENT_DIR"

      while ! $CLIENT rpc get chains/main/blocks/head; do
          sleep 5
      done

      exec $CMD run ${extra_args} ${my_baker_account}

    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment Variables from:
      tezos-config  ConfigMap  Optional: false
    Environment:
      MY_POD_IP:       (v1:status.podIP)
      MY_POD_NAME:    tezos-baking-node-0 (v1:metadata.name)
      MY_POD_TYPE:    node
      MY_NODE_CLASS:  tezos-baking-node
      DAEMON:         baker
    Mounts:
      /etc/tezos from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nf8zz (ro)
      /var/tezos from var-volume (rw)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  var-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  var-volume-tezos-baking-node-0
    ReadOnly:   false
  dev-net-tun:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/net/tun
    HostPathType:  
  config-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  default-token-nf8zz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nf8zz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  3m36s                  default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling  3m36s                  default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled         3m33s                  default-scheduler  Successfully assigned tezos-testnet/tezos-baking-node-0 to minikube
  Normal   Pulled            117s (x5 over 3m32s)   kubelet            Container image "tezos-k8s-utils:dev" already present on machine
  Normal   Created           117s (x5 over 3m32s)   kubelet            Created container wait-for-bootstrap
  Normal   Started           117s (x5 over 3m32s)   kubelet            Started container wait-for-bootstrap
  Warning  BackOff           106s (x10 over 3m30s)  kubelet            Back-off restarting failed container
harryttd commented 3 years ago

To confirm that -b works is to see the output of the docker build as it is happening. Is docker reporting that it is using cached layers? Or is it building the layers?

It looks like there is an error in the wait-for-bootstrap init container. Do you have the logs?

hai-nguyen-van commented 3 years ago

To confirm that -b works is to see the output of the docker build as it is happening. Is docker reporting that it is using cached layers? Or is it building the layers?

Ok. I will try once more to reproduce this.

It looks like there is an error in the wait-for-bootstrap init container.

Here are the logs but I have the impression it is not relevant:

$ kubectl -n tezos-testnet logs tezos-baking-node-0 wait-for-bootstrap
+ CMD=wait-for-bootstrap
+ shift
+ exec /wait-for-bootstrap.sh
jq: error (at <stdin>:33): Cannot index array with string "is_bootstrap_node"
No bootstrap nodes were provided
harryttd commented 3 years ago

Here are the logs but I have the impression it is not relevant:

This is an error. It means that none of the nodes specified in values.yaml are set as a bootstrap node. So nodes don't know which node(s) to connect to after a chain has been activated. You can specify an instance as being a bootstrap node with is_bootstrap_node property.

nodes:
 tezos-baking-node:
   storage_size: 15Gi
   runs:
     - baker
     - endorser
   instances:
     - bake_using_account: baker0
       is_bootstrap_node: true
harryttd commented 3 years ago

@hai-nguyen-van Is this resolved? Any other issues?