helm / charts

⚠️(OBSOLETE) Curated applications for Kubernetes
Apache License 2.0
15.49k stars 16.79k forks source link

[stable/concourse] BUG workers keep restarting after failing to reach TSA #4812

Closed razvan-moj-zz closed 6 years ago

razvan-moj-zz commented 6 years ago

Version of Helm and Kubernetes: helm 2.8.2, k8s 1.8.6

Which chart: stable/concourse

What happened: Deployment successful, but web interface reports 'no workers' and indeed the worker containers keep restarting after error 137

What you expected to happen: concourse-worker-* containers up, pipelines running

How to reproduce it (as minimally and precisely as possible): helm install --namespace development --name concourse -f values.yaml stable/concourse ingress enabled, type ClusterIP atcPort and tsaPort are defined and forwarded back correctly worker containers exit with 'failed to connect to TSA: dial tcp: address concourse-web: missing port in address'

Anything else we need to know: other charts, of varied complexity, do deploy correctly onto the same cluster

brunoban commented 6 years ago

I can confirm this.

You can get it to work by editing the statefulset's CONCOURSE_TSA_HOST to be hostname:port instead of just hostname.

Seems to come from this line, on charts/stable/concourse/templates/worker-statefulset.yaml:79:

          env:
            - name: CONCOURSE_TSA_HOST
          {{- if semverCompare "^3.10" .Values.imageTag }}
              value: "{{ template "concourse.web.fullname" . }}:{{ .Values.concourse.tsaPort}}"
          {{- else }}
              value: {{ template "concourse.web.fullname" . }}
            - name: CONCOURSE_TSA_PORT
              value: {{ .Values.concourse.tsaPort | quote }}

Not sure of why the cutout is exactly at 3.10, or if just readjusting it is the proper way to solve this.

razvan-moj-zz commented 6 years ago

You're right, thank you.

  env:
    - name: CONCOURSE_TSA_HOST
      value: concourse-web:2222

fixed it

vikas027 commented 6 years ago

Hey @brunoban ,

I am too facing a similar issue in which one of the two concourse workers keeps on crashing with this error. The other worker runs absolutely fine.

$ kubectl get pods -l app=concourse-worker
NAME                 READY     STATUS             RESTARTS   AGE
concourse-worker-0   0/1       CrashLoopBackOff   10         30m
concourse-worker-1   1/1       Running            2          30m
$

$ kubectl logs -f concourse-worker-0
{"timestamp":"1524835284.751764536","source":"worker","message":"worker.garden.extract-resources.extract.extracting","log_level":1,"data":{"resource-type":"tracker","session":"2.1.13"}}
overlay driver requires kernel version >= 4.0.0
{"timestamp":"1524835285.287182331","source":"baggageclaim","message":"baggageclaim.failed-to-set-up-driver","log_level":2,"data":{"error":"overlay driver requires kernel version \u003e= 4.0.0"}}

Strangely both of my Rancher workers are exactly same (built through Ansible) and have kernel version

$ uname -r
4.16.5-1.el7.elrepo.x86_64

Is it the same issue or shall I open a new ticket for the same?

brunoban commented 6 years ago

Hey @vikas027,

It's not the same thing as far as I could see, so I would open a new ticket for this one. But I suspect that might be from Concourse/BaggageClaim itself and not the chart (at least as far as my brief research went)

pclalv commented 6 years ago

the latest version of the chart seems to only include the port in the CONCOURSE_TSA_HOST env var if the version is 3.10.x, but my experiencing deploying 3.14.0 with the latest version of the helm chart suggests that it's still necessary to include the TSA port in the CONCOURSE_TSA_HOST env var.

xtremerui commented 6 years ago

In concourse team when we set the imageTag to 4.0.0-rc.12 , the semverCompare considers that is a lower version to ^3.10.x and thus makes worker failed to register to TSA.

https://github.com/kubernetes/charts/blob/fd35ab34bfd26878fb5970fbc7c3e75760df10ec/stable/concourse/templates/worker-statefulset.yaml#L81

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue is being automatically closed due to inactivity.