jfrog / charts

JFrog official Helm Charts
https://jfrog.com/integration/helm-repository/
Apache License 2.0
257 stars 447 forks source link

nginx keeps crashing with the following error message: nginx: [emerg] host not found in upstream "artifactory #308

Closed yamaguchim closed 5 years ago

yamaguchim commented 5 years ago

artifactory version: artifactory-pro:6.9.0

chart: artifactory HA

What happened: image002 (2)

What you expected to happen: nginx to identify the changes and run

How to reproduce it (as minimally and precisely as possible): kubernetes and baremetal with Metallb as a l2 loadbalancer

eldada commented 5 years ago

@yamaguchim - Does this error persist even after pod delete and restart? Or does the nginx restart eventually solve this?

Insomniak47 commented 5 years ago

It does not eventually get solved. The pod goes into a crashloopbackoff loop and never comes online

Lykathia commented 5 years ago

Of note: Everything works as expected, until SSL is configured.

SSL is stored w/ k8s secrets.

Relevant snippets for repro:

Ansible:

      command:  
        kubectl create secret tls artifactory-tls 
        --cert={{ artifactory_config_dir.path }}/art.pem
        --key={{ artifactory_config_dir.path }}/art.key --namespace={{ tillerApplicationNamespace }}

Config:

nginx:
  enabled: true 
  tlsSecretName: artifactory-tls
  service:
    env:
      ssl: true
jainishshah17 commented 5 years ago

@Lykathia Have you configured HTTP Settings in Artifactory?

Insomniak47 commented 5 years ago

@jainishshah17 What http settings are you referring to?

(Not just a random person btw. @Lykathia and I are the ones attempting to get a deployment going for this)

If you mean the port and health check related settings then no, there's been no changes to default port, or health check data. The only difference between this and a working, non-tls deployment is the secret and the added nginx config.

danielezer commented 5 years ago

@Insomniak47 @Lykathia http settings in Artifactory https://www.jfrog.com/confluence/display/RTF/Configuring+a+Reverse+Proxy

Lykathia commented 5 years ago

@danielezer the artifactory UI is not accessible, since the nginx pod crashes during deployment. Are these settings configurable in the chart?

Insomniak47 commented 5 years ago

@danielezer as @Lykathia said we can't access those because the basic TLS load balancing/reverse proxy setup causes this issue in our cluster. Since we're trying to do this as with an IAC approach I'd rather not bring up the non-tls and then configure it but I could as an interim measure until we figure this out.

We're using all of the same values from values.yaml from a working non-tls, but loadbalanced, working deployment of the artifactory-ha chart and then only changing the ones @Lykathia mentioned which should give us a working HTTPS setup with the reverse proxy configured based on my understanding of the chart.

jainishshah17 commented 5 years ago

@Lykathia @Insomniak47 Can you provide us helm chart version?

jainishshah17 commented 5 years ago

Also please provide output of kubectl describe pod $NGINX_POD and kubectl describe deployment $NGINX_DEPLOYMENT

Lykathia commented 5 years ago

The last chart version I had tried was 0.12.12, altho was happening in earlier as well.

The entire automation suite being used is attached to case #101892.

Insomniak47 commented 5 years ago

@jainishshah17 were you able to get the relevant info from the case?

nimerb commented 5 years ago

@Insomniak47 Can you please provide us with the output of kubectl describe pod $NGINX_PODand kubectl describe deployment $NGINX_DEPLOYMENT These commands will have more information that will assist in figuring out the issue.

Insomniak47 commented 5 years ago

Pod:

Name:               artifactory-nginx-689cc8894d-gzx6m
Namespace:          applications
Priority:           0
PriorityClassName:  <none>
Node:               staging-kubernetes-2/10.200.116.11
Start Time:         Mon, 29 Apr 2019 16:32:35 -0300
Labels:             app=artifactory-ha
                    chart=artifactory-ha-0.12.19
                    component=nginx
                    heritage=Tiller
                    pod-template-hash=689cc8894d
                    release=artifactory
Annotations:        <none>
Status:             Running
IP:                 10.244.3.3
Controlled By:      ReplicaSet/artifactory-nginx-689cc8894d
Init Containers:
  wait-for-artifactory:
    Container ID:  docker://d9fc7589afaa1e9dfe3bc771d01e259daaa626d5ed526ae4972f552225e26c6c
    Image:         alpine:3.8
    Image ID:      docker-pullable://alpine@sha256:a4d41fa0d6bb5b1194189bab4234b1f2abfabb4728bda295f5c53d89766aa046
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      until nc -z -w 2 artifactory-artifactory-ha 8081 && echo artifactory ok; do
        sleep 2;
      done;

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 29 Apr 2019 16:32:38 -0300
      Finished:     Mon, 29 Apr 2019 16:39:53 -0300
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from artifactory-artifactory-ha-token-czrx4 (ro)
Containers:
  nginx:
    Container ID:   docker://3c5fb028c96c0d5cc411971ac0563b714cb443d7644ce8440a316cb655ee0bd0
    Image:          docker.bintray.io/jfrog/nginx-artifactory-pro:6.9.1
    Image ID:       docker-pullable://docker.bintray.io/jfrog/nginx-artifactory-pro@sha256:a93fcfb32b45deb69c4174d49c45d7a2184ede10950d7585a131ae8db568b853
    Ports:          80/TCP, 443/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 29 Apr 2019 16:40:49 -0300
      Finished:     Mon, 29 Apr 2019 16:40:49 -0300
    Ready:          False
    Restart Count:  3
    Liveness:       http-get http://:80/artifactory/webapp/%23/login delay=100s timeout=10s period=10s #success=1 #failure=10
    Readiness:      http-get http://:80/artifactory/webapp/%23/login delay=60s timeout=10s period=10s #success=1 #failure=10
    Environment:
      ART_BASE_URL:             http://artifactory-artifactory-ha:8081/artifactory
      SSL:                      true
      SKIP_AUTO_UPDATE_CONFIG:  false
    Mounts:
      /tmp/ssl from ssl-secret-volume (rw)
      /var/opt/jfrog/nginx from nginx-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from artifactory-artifactory-ha-token-czrx4 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  nginx-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  ssl-secret-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  artifactory-tls
    Optional:    false
  artifactory-artifactory-ha-token-czrx4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  artifactory-artifactory-ha-token-czrx4
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason               Age                From                           Message
  ----     ------               ----               ----                           -------
  Normal   Scheduled            8m24s              default-scheduler              Successfully assigned applications/artifactory-nginx-689cc8894d-gzx6m to staging-kubernetes-2
  Normal   Pulling              8m23s              kubelet, staging-kubernetes-2  Pulling image "alpine:3.8"
  Normal   Pulled               8m21s              kubelet, staging-kubernetes-2  Successfully pulled image "alpine:3.8"
  Normal   Created              8m21s              kubelet, staging-kubernetes-2  Created container wait-for-artifactory
  Normal   Started              8m21s              kubelet, staging-kubernetes-2  Started container wait-for-artifactory
  Normal   Pulling              65s                kubelet, staging-kubernetes-2  Pulling image "docker.bintray.io/jfrog/nginx-artifactory-pro:6.9.1"
  Normal   Pulled               52s                kubelet, staging-kubernetes-2  Successfully pulled image "docker.bintray.io/jfrog/nginx-artifactory-pro:6.9.1"
  Normal   Started              32s (x3 over 52s)  kubelet, staging-kubernetes-2  Started container nginx
  Warning  FailedPostStartHook  32s (x3 over 52s)  kubelet, staging-kubernetes-2  Exec lifecycle hook ([/bin/sh -c until [ -f /etc/nginx/conf.d/artifactory.conf ]; do sleep 1 ; done; if ! grep -q 'upstream' /etc/nginx/conf.d/artifactory.conf; then sed -i -e 's,proxy_pass.*http://artifactory.*/artifactory/\(.*\);,proxy_pass       http://artifactory-artifactory-ha:8081/artifactory/\1;,g' \
    -e 's,server_name .*,server_name ~(?<repo>.+)\\.artifactory-artifactory-ha artifactory-artifactory-ha;,g' \
    /etc/nginx/conf.d/artifactory.conf;
fi; if [ -f /tmp/replicator-nginx.conf ]; then cp -fv /tmp/replicator-nginx.conf /etc/nginx/conf.d/replicator-nginx.conf; fi; if [ -f /tmp/ssl/*.crt ]; then rm -rf /var/opt/jfrog/nginx/ssl/example.*; cp -fv /tmp/ssl/* /var/opt/jfrog/nginx/ssl; fi; sleep 5; nginx -s reload; touch /var/log/nginx/conf.done
]) for Container "nginx" in Pod "artifactory-nginx-689cc8894d-gzx6m_applications(8a96ea41-6ab5-11e9-9cb6-005056b31279)" failed - error: command '/bin/sh -c until [ -f /etc/nginx/conf.d/artifactory.conf ]; do sleep 1 ; done; if ! grep -q 'upstream' /etc/nginx/conf.d/artifactory.conf; then sed -i -e 's,proxy_pass.*http://artifactory.*/artifactory/\(.*\);,proxy_pass       http://artifactory-artifactory-ha:8081/artifactory/\1;,g' \
    -e 's,server_name .*,server_name ~(?<repo>.+)\\.artifactory-artifactory-ha artifactory-artifactory-ha;,g' \
    /etc/nginx/conf.d/artifactory.conf;
fi; if [ -f /tmp/replicator-nginx.conf ]; then cp -fv /tmp/replicator-nginx.conf /etc/nginx/conf.d/replicator-nginx.conf; fi; if [ -f /tmp/ssl/*.crt ]; then rm -rf /var/opt/jfrog/nginx/ssl/example.*; cp -fv /tmp/ssl/* /var/opt/jfrog/nginx/ssl; fi; sleep 5; nginx -s reload; touch /var/log/nginx/conf.done
' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused \"process_linux.go:91: executing setns process caused \\\"exit status 21\\\"\": unknown\r\n"
  Normal   Killing  32s (x3 over 52s)  kubelet, staging-kubernetes-2  FailedPostStartHook
  Warning  BackOff  24s (x4 over 50s)  kubelet, staging-kubernetes-2  Back-off restarting failed container
  Normal   Created  11s (x4 over 52s)  kubelet, staging-kubernetes-2  Created container nginx
  Normal   Pulled   11s (x3 over 51s)  kubelet, staging-kubernetes-2  Container image "docker.bintray.io/jfrog/nginx-artifactory-pro:6.9.1" already present on machine

Deployment

Name:                   artifactory-nginx
Namespace:              applications
CreationTimestamp:      Mon, 29 Apr 2019 16:32:35 -0300
Labels:                 app=artifactory-ha
                        chart=artifactory-ha-0.12.19
                        component=nginx
                        heritage=Tiller
                        release=artifactory
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=artifactory-ha,component=nginx,release=artifactory
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=artifactory-ha
                    chart=artifactory-ha-0.12.19
                    component=nginx
                    heritage=Tiller
                    release=artifactory
  Service Account:  artifactory-artifactory-ha
  Init Containers:
   wait-for-artifactory:
    Image:      alpine:3.8
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      until nc -z -w 2 artifactory-artifactory-ha 8081 && echo artifactory ok; do
        sleep 2;
      done;

    Environment:  <none>
    Mounts:       <none>
  Containers:
   nginx:
    Image:       docker.bintray.io/jfrog/nginx-artifactory-pro:6.9.1
    Ports:       80/TCP, 443/TCP
    Host Ports:  0/TCP, 0/TCP
    Liveness:    http-get http://:80/artifactory/webapp/%23/login delay=100s timeout=10s period=10s #success=1 #failure=10
    Readiness:   http-get http://:80/artifactory/webapp/%23/login delay=60s timeout=10s period=10s #success=1 #failure=10
    Environment:
      ART_BASE_URL:             http://artifactory-artifactory-ha:8081/artifactory
      SSL:                      true
      SKIP_AUTO_UPDATE_CONFIG:  false
    Mounts:
      /tmp/ssl from ssl-secret-volume (rw)
      /var/opt/jfrog/nginx from nginx-volume (rw)
  Volumes:
   nginx-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
   ssl-secret-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  artifactory-tls
    Optional:    false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    True    ReplicaSetUpdated
OldReplicaSets:  artifactory-nginx-689cc8894d (1/1 replicas created)
NewReplicaSet:   <none>
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  9m18s  deployment-controller  Scaled up replica set artifactory-nginx-689cc8894d to 1
eldada commented 5 years ago

Hi all. I think I found the problem. I'm working on recreating the scenario to see if this is indeed the issue.

eldada commented 5 years ago

@Insomniak47 - can you provide the describe on the secret?

kubectl describe secret artifactory-tls -n NAMESPACE
Lykathia commented 5 years ago

@eldada I'm afraid we won't be able to provide the describe on that secret (due both to @Insomniak47 being out of office this week, and it containing our TLS cert :))

Was there something in particular you were looking for?

Here is the snippet of ansible that creates that secret:

tillerApplicationNamespace: applications
artifactoryNamespace: applications
    - name: artifactory | create kubernetes secret 
      command:  
        kubectl create secret tls artifactory-tls 
        --cert={{ artifactory_config_dir.path }}/art.pem
        --key={{ artifactory_config_dir.path }}/art.key --namespace={{tillerApplicationNamespace}}

I've double checked that both art.pem and art.key arrive un-corrupted in the secret store, and are valid.

eldada commented 5 years ago

The describe does not show the actual values, so you are safe 😉. I'm trying to see if the volume that puts the key + certificate in the nginx container is working properly. We have a postStart hook in the chart that copies /tmp/ssl/* to /var/opt/jfrog/nginx/ssl and then the entrypoint picks this up and injects it to the artifactory.conf. When I do it locally, it works well. So I'm trying to see where the gap is.

Lykathia commented 5 years ago

I'll see if the cluster was left running here in a bit. It may be difficult to get an answer promptly this week however.

My hunch was that it might be tied to permissions, since we are also segmenting and auditing everything to be least privilege. Sadly, I haven't gotten any time to explore that further. Our tiller doesn't have carte blanche access (I believe the configs for this are tied to the case ID mentioned above, do not want to post publicly)

Lykathia commented 5 years ago
Name:         artifactory-tls
Namespace:    applications
Labels:       <none>
Annotations:  <none>

Type:  kubernetes.io/tls

Data
====
tls.crt:  2455 bytes
tls.key:  1958 bytes
eldada commented 5 years ago

Thanks @Lykathia . This is exactly how my described secret looks like.

  1. Besides what's noted in https://github.com/jfrog/charts/issues/308#issuecomment-483282188, are there any other steps you did for the setup?
  2. Can you try again with the latest chart version?
Lykathia commented 5 years ago

The entire setup is done via ansible, and should be attached to case #101892. Nothing is being done manually.

I'll give the latest chart a spin later today and update accordingly.

Insomniak47 commented 5 years ago

Just ran a new deployment:

cnantau@staging-kubernetes-0:~$ kubectl log artifactory-nginx-8477577446-5g5qx -n applications
log is DEPRECATED and will be removed in a future version. Use logs instead.
Using deprecated password for user _internal.
2019-05-21 19:54:39  [138 entrypoint-nginx.sh] Preparing to run Nginx in Docker
2019-05-21 19:54:39   [11 entrypoint-nginx.sh] Dockerfile for this image can found inside the container.
2019-05-21 19:54:39   [12 entrypoint-nginx.sh] To view the Dockerfile: 'cat /docker/nginx-artifactory-pro/Dockerfile.nginx'.
2019-05-21 19:54:39   [68 entrypoint-nginx.sh] Setting up directories if missing
2019-05-21 19:54:39  [132 entrypoint-nginx.sh] Artifactory configuration already in /var/opt/jfrog/nginx/conf.d/artifactory.conf
2019-05-21 19:54:39   [27 entrypoint-nginx.sh] SSL is set. Setting up SSL certificate and key
2019-05-21 19:54:39   [40 entrypoint-nginx.sh] Found SSL_KEY /var/opt/jfrog/nginx/ssl/example.key
2019-05-21 19:54:39   [41 entrypoint-nginx.sh] Found SSL_CRT /var/opt/jfrog/nginx/ssl/example.crt
2019-05-21 19:54:39   [53 entrypoint-nginx.sh] Updating /var/opt/jfrog/nginx/conf.d/artifactory.conf with /var/opt/jfrog/nginx/ssl/example.key and /var/opt/jfrog/nginx/ssl/example.crt
2019-05-21 19:54:39  [149 entrypoint-nginx.sh] Starting updateConf.sh in the background
Using deprecated password for user _internal.
2019-05-21 19:54:39  [154 entrypoint-nginx.sh] Starting nginx daemon...
nginx: [emerg] host not found in upstream "artifactory" in /etc/nginx/conf.d/artifactory.conf:31

Looking like the upstream is still not being set properly. Where is it specifically sourced from?

eldada commented 5 years ago

The artifactory service is written in the /etc/nginx/conf.d/artifactory.conf at postStart hook (if upstream is not set), it is supposed to be updated with artifactory-artifactory-ha.

  1. Are you mounting a custom configuration to nginx?
  2. Are you configuring the HTTP settings in Artifactory?

NOTE: The Artifactory service is named artifactory-artifactory-ha. The /etc/nginx/conf.d/artifactory.conf is not being updated (it's left with a default artifactory). This might indicate there is a non-default config (pre-mounted or auto generated).

Insomniak47 commented 5 years ago

We aren't mounting any custom configuration for nginx. As @Lykathia mentioned the entire deployment playbook + chart config are available on the ticket (#101892).

We never get into artifactory because the service that is exposing it never comes up so we're definitely not doing it from there.

Are you doing anything in the process that requires a specific set of roles for tiller? Our tiller is least privileged and artifactory is in a non-default namespace (applications). Are you interacting with the K8s api in a way that would fail if tiller is RBAC'd to specific privs and not just a global admin? Are there any assumptions being made wrt the namespace?

Insomniak47 commented 5 years ago

The current chart/images are now failing on a previously successful deployment. (Head of master of our IAC defs + changes to support networkpolicy rbac through tiller)

Unfortunately the sh in here isn't really debuggable since it's all chained together.

Warning  FailedPostStartHook  6m30s (x3 over 6m51s)  kubelet, staging-kubernetes-3  Exec lifecycle hook ([/bin/sh -c until [ -f /etc/nginx/conf.d/artifactory.conf ]; do sleep 1 ; done; if ! grep -q 'upstream' /etc/nginx/conf.d/artifactory.conf; then sed -i -e 's,proxy_pass.*http://artifactory.*/artifactory/\(.*\);,proxy_pass       http://artifactory-artifactory-ha:8081/artifactory/\1;,g' \
    -e 's,server_name .*,server_name ~(?<repo>.+)\\.artifactory-artifactory-ha artifactory-artifactory-ha;,g' \
    /etc/nginx/conf.d/artifactory.conf;
fi; if [ -f /tmp/replicator-nginx.conf ]; then cp -fv /tmp/replicator-nginx.conf /etc/nginx/conf.d/replicator-nginx.conf; fi; if [ -f /tmp/ssl/*.crt ]; then rm -rf /var/opt/jfrog/nginx/ssl/example.*; cp -fv /tmp/ssl/* /var/opt/jfrog/nginx/ssl; fi; sleep 5; nginx -s reload; touch /var/log/nginx/conf.done
]) for Container "nginx" in Pod "artifactory-nginx-57f96fc755-kw7m9_applications(ddecfa82-7d84-11e9-83b8-005056b3ad60)" failed - error: command '/bin/sh -c until [ -f /etc/nginx/conf.d/artifactory.conf ]; do sleep 1 ; done; if ! grep -q 'upstream' /etc/nginx/conf.d/artifactory.conf; then sed -i -e 's,proxy_pass.*http://artifactory.*/artifactory/\(.*\);,proxy_pass       http://artifactory-artifactory-ha:8081/artifactory/\1;,g' \
    -e 's,server_name .*,server_name ~(?<repo>.+)\\.artifactory-artifactory-ha artifactory-artifactory-ha;,g' \
    /etc/nginx/conf.d/artifactory.conf;
fi; if [ -f /tmp/replicator-nginx.conf ]; then cp -fv /tmp/replicator-nginx.conf /etc/nginx/conf.d/replicator-nginx.conf; fi; if [ -f /tmp/ssl/*.crt ]; then rm -rf /var/opt/jfrog/nginx/ssl/example.*; cp -fv /tmp/ssl/* /var/opt/jfrog/nginx/ssl; fi; sleep 5; nginx -s reload; touch /var/log/nginx/conf.done
' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused \"process_linux.go:91: executing setns process caused \\\"exit status 21\\\"\": unknown\r\n"
Insomniak47 commented 5 years ago

Just checked with a clean deployment of kubernetes w/ RBAC, full cluster-admin permissions, default namespace. Same error.

Versions: Kubernetes

cnantau@staging-kubernetes-0:~$ kubectl version -o=yaml
clientVersion:
  buildDate: "2019-05-16T16:23:09Z"
  compiler: gc
  gitCommit: 66049e3b21efe110454d67df4fa62b08ea79a19b
  gitTreeState: clean
  gitVersion: v1.14.2
  goVersion: go1.12.5
  major: "1"
  minor: "14"
  platform: linux/amd64
serverVersion:
  buildDate: "2019-05-16T16:14:56Z"
  compiler: gc
  gitCommit: 66049e3b21efe110454d67df4fa62b08ea79a19b
  gitTreeState: clean
  gitVersion: v1.14.2
  goVersion: go1.12.5
  major: "1"
  minor: "14"
  platform: linux/amd64

Helm:

Client: &version.Version{SemVer:"v2.13.1", GitCommit:"618447cbf203d147601b4b9bd7f8c37a5d39fbb4", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.13.1", GitCommit:"618447cbf203d147601b4b9bd7f8c37a5d39fbb4", GitTreeState:"clean"}

Docker:

cnantau@staging-kubernetes-0:~$ sudo docker version
Client:
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        481bc77
 Built:             Sat May  4 02:35:27 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May  4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false

OS:

cnantau@staging-kubernetes-0:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
[...]
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

If you have sufficient vSphere infrastructure I can provide you an IAC definition that can repro the entire cluster.

All of the other services on the cluster are running fine:

Insomniak47 commented 5 years ago

Just tested the def on AKS with the same kubernetes version as in my last comment and it works which means it's probably an interaction with the baremetal environment we've got. Since we've repro'd this in two different baremetal envs with hosts based on Ubuntu 16.04

(Using flannel for internal networking as well)

Lykathia commented 5 years ago

I said 22 days ago I would run everything for a spin. Turns out, release pressures are crazy and haven't had a chance to breathe for the last 22 days. :(

Today however...

Removing terraform from the equation, to hopefully give a bit easier of a time to replicate (guessing don't have access to vsphere, based on the silence around the automated environment replication).

1604 AWS Image: ami-0565af6e282977273 1804 AWS Image: ami-0a313d6098716f372

Same ansible automation, no manual steps.

Both have the same versions of all of the following:

cri-tools 1.12.0-00
kubeadm 1.14.2-00
kubectl 1.14.2-00
kubelet 1.14.2-00
kubernetes-cni 0.7.5-00
helm 2.14.0-98
docker 18.09.6 481bc77

I ran 4x 1804 clusters, and 2x 1604 clusters (all clusters 4x xlarge images, 1 master 3 nodes).

From what I can observe, there is some race-conditions going on in the nginx container. It crashes a few times on 1804, but then eventually comes up.

It never comes up on 1604, even after manually trying to restart it.

Warning  FailedPostStartHook  13m (x3 over 13m)  kubelet, ip-172-31-27-159  Exec lifecycle hook ([/bin/sh -c until [ -f /etc/nginx/conf.d/artifactory.conf ]; do sleep 1 ; done; if ! grep -q 'upstream' /etc/nginx/conf.d/artifactory.conf; then sed -i -e 's,proxy_pass.*http://artifactory.*/artifactory/\(.*\);,proxy_pass       http://artifactory-artifactory-ha:8081/artifactory/\1;,g'     -e 's,server_name .*,server_name ~(?<repo>.+)\.artifactory-artifactory-ha artifactory-artifactory-ha;,g'     /etc/nginx/conf.d/artifactory.conf;
fi; if [ -f /tmp/replicator-nginx.conf ]; then cp -fv /tmp/replicator-nginx.conf /etc/nginx/conf.d/replicator-nginx.conf; fi; if [ -f /tmp/ssl/*.crt ]; then rm -rf /var/opt/jfrog/nginx/ssl/example.*; cp -fv /tmp/ssl/* /var/opt/jfrog/nginx/ssl; fi; sleep 5; nginx -s reload; touch /var/log/nginx/conf.done
]) for Container "nginx" in Pod "artifactory-nginx-844fc748c6-j9p4t_applications(0e205ea8-8237-11e9-9937-0a8092ba037e)" failed - error: command '/bin/sh -c until [ -f /etc/nginx/conf.d/artifactory.conf ]; do sleep 1 ; done; if ! grep -q 'upstream' /etc/nginx/conf.d/artifactory.conf; then sed -i -e 's,proxy_pass.*http://artifactory.*/artifactory/\(.*\);,proxy_pass       http://artifactory-artifactory-ha:8081/artifactory/\1;,g'     -e 's,server_name .*,server_name ~(?<repo>.+)\.artifactory-artifactory-ha artifactory-artifactory-ha;,g'     /etc/nginx/conf.d/artifactory.conf;
fi; if [ -f /tmp/replicator-nginx.conf ]; then cp -fv /tmp/replicator-nginx.conf /etc/nginx/conf.d/replicator-nginx.conf; fi; if [ -f /tmp/ssl/*.crt ]; then rm -rf /var/opt/jfrog/nginx/ssl/example.*; cp -fv /tmp/ssl/* /var/opt/jfrog/nginx/ssl; fi; sleep 5; nginx -s reload; touch /var/log/nginx/conf.done
' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused \"process_linux.go:91: executing setns process caused \\"exit status 21\\"\": unknown\r\n"
Lykathia commented 5 years ago

Also of potential note: it always take 6 restarts before it starts working on the 18.04 AMI on t2.xlarge

Insomniak47 commented 5 years ago

Hey guys,

Could we get an update on this? Last update from maintainers is 12D. 28D Since we've been asked for new information or updated on the progress. In that time we've dumped repro info for vSphere + AWS. Have you made any progress? How can we facilitate this?

eldada commented 5 years ago

@Insomniak47 - sorry for the delay. We are looking for reproducing the environment and errors.

alexivkin commented 5 years ago

I hit the same issue today with the latest helm chart. Looks like it is a race condition because after killing the nginx pod a couple of times it eventually started