Reoccurrence of Service does not have any active Endpoint [when it actually does]

scott-kausler commented 1 year ago

What happened: The ingress controller reported that the "Service does not have any active Endpoint" when in fact the service did have active endpoints.

I was able to verify the service was active by execing into the nginx pod and curling the health check endpoint of the service.

The only way I was able to recover was to reinstall the helm chart.

What you expected to happen:

The service to be added to ingress controller

NGINX Ingress controller version:

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.6.4
  Build:         69e8833858fb6bda12a44990f1d5eaa7b13f4b75
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.21.6

-------------------------------------------------------------------------------

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.6-eks-48e63af", GitCommit:"9f22d4ae876173884749c0701f01340879ab3f95", GitTreeState:"clean", BuildDate:"2023-01-24T19:19:02Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

Environment: AWS EKS

Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.6-eks-48e63af", GitCommit:"9f22d4ae876173884749c0701f01340879ab3f95", GitTreeState:"clean", BuildDate:"2023-01-24T19:19:02Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

How was the ingress-nginx-controller installed: nginx nginx 1 2023-05-06 16:52:09.643618809 +0000 UTC deployed ingress-nginx-4.5.2 1.6.4

Values:

  ingressClassResource:
    default: true
  service:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-internal: "true"
      service.beta.kubernetes.io/aws-load-balancer-type: nlb

How to reproduce this issue: Unknown. There was a single replica of the pod, and it was deployed for 42 days before exhibiting this problem.

However, others have recently reported this issue in https://github.com/kubernetes/ingress-nginx/issues/6135.

Anything else we need to know:

The problem was previously reported in https://github.com/kubernetes/ingress-nginx/issues/6135, but the defect was closed.

longwuyuan commented 1 year ago

@RoyAtanu your comment does not explain which of the 2 below statements is true ;

(1) if there was a one time message in logs occuring after startup/init and before first reconciliation (2) if there is a error-message after several hours/days/weeks of startup/init whose timestamp is matching with a failed HTTP/HTTPS request

@all if this is a bug, then it needs to be fixed asap. But without reproducing the problem, its not certain as to what the problem is.

One a normally working HTTP/HTTPS request fails and controller logs the error message, then please capture and post ;

kubectl describe .... output of the ingress
kubectl get all,ing -o wide output of the app namespace
kubectl logs .. output of the app pod
kubectl logs .. output of the ingress-controller pod

longwuyuan commented 1 year ago

@RoyAtanu I don't know what action the project can take based on your report because ;

it seems certain that you see a error message when you enable tls, BUT it is unknown if your HTTP/HTTPS request failed
it is unknown if the timestamp of the error message was corelating to the timestamp of you sending HTTP/HTTPS request
it is unknown if the pod or cluster had genuine problems causing a genuine lack of endpoint or if the controller has a bug that causes intermittent broken endpoint

Above observation is based on assumption that if there is a genuine state of backend pod in some kind of trouble, then logging a error message reporting no endpoint seems expected behavior.

Also, I am using v1.8.0 of the controller and I see that error only once on startup. After that I never see that error message and my HTTP/HTTPS requests never fail by logging the missing endpoint error message. So no idea at all how to reproduce a problem where HTTP/HTTPS fails and controller logs missing endpoint error for same timestamp.

To repeat, earlier comments from me, if we can see that everything is healthy in the cluster, and the controller is the root-cause of breaking/failing HTTP/HTTPS requests, and the proof that the timestamp of sending the broken/failed HTTP/HTTPS request "co-relates" to the timestamp of the error-message in controller logs, then we can reproduce the problem in minikube or kind cluster to debug it and fix it. Without proof of controller being root-cause, if all the reports here only mention that a error message about endpoints was seen in the controller pod logs, then at least I don't know if that was falsely logged by the controller or if there was a genuine problem outside the controller, which caused a broken endpoint and hence a broken/failed HTTP/HTTPS request.

RoyAtanu commented 1 year ago

Concluding on my scenario - we have been able to RCA it and identified that this message was never an issue. We still do see this message (which possibly indicates this is not an error message) and after tailing the logs for long enough, our understanding is that this message is printed by nginx earlier in the lifecycle, most possibly before it has actually loaded the Ingress definitions.

In our case, we were missing below config in nginx-ingress, adding which resolved the issue.

set {
    name  = "controller.service.externalTrafficPolicy"
    value = "Local"
  }

I am not sure why absence of this config was specifically blocking HTTPS traffic (and removing TLS was working fine), since docu says this config allows the client ip to passed on to the pod. But this did the trick so I would assume there are some internal mechanism dependent on it.

longwuyuan commented 1 year ago

@RoyAtanu I don't have kubectl outputs and curl outputs with -v and the logs from before and after change. But glad problem solved.

Check kubectl explain service.spec.externalTrafficPolicy because it decides routing for the ingress traffic. Protocol specific behaviour should not be commented on before seeing logs so no comments.

rdb0101 commented 1 year ago

@longwuyuan I feel like this solution is applicable to @RoyAtanu situation and not necessarily the situation or problem at hand in this discussion. None of my https requests are failing, just disconnecting periodically with that error message while the endpoints are indeed active.

rdb0101 commented 1 year ago

@RoyAtanu and @longwuyuan The settings and commands mentioned previously are not applicable to RKE2, the current environment I am using. externalTrafficPolicy can be changed and set in the LoadBalancer service -- however I am not using one. I did edit the rke2-ingress-nginx-controller admission and changed it from ClusterIP to LB and added the externalTrafficPolicy. All of the services are now showing there are no active endpoints for the service(s) when they are active.... it did not fix the issue unfortunately. Do you have any other recommendations?

longwuyuan commented 1 year ago

@rdb0101 I think you are following your own thoughts instead of co-operating. That is why there is less progress. Almost all your comments here are related to what you think or what you see or what you want to do.

I think at least I have repeated what info will help. You have many things but you have not provided that information I asked in one single post.

rdb0101 commented 1 year ago

@longwuyuan My sincerest apologies for coming off that way; it was not my intention. I will review the information you requested, which I must have misunderstood on my end. Again, I apologize for this; I want to resolve this as soon as possible, as it has been a blocker for a while.

rdb0101 commented 1 year ago

@longwuyuan I went through the threads and you are corrected I only provided out put for get commands and not describe. Please see below for (hopefully) the command output(s):

---
# kubectl -n kube-system describe daemonset.apps rke2-ingress-nginx-controller

Name:           rke2-ingress-nginx-controller
Selector:       app.kubernetes.io/component=controller,app.kubernetes.io/instance=rke2-ingress-nginx,app.kubernetes.io/name=rke2-ingress-nginx
Node-Selector:  kubernetes.io/os=linux
Labels:         app.kubernetes.io/component=controller
                app.kubernetes.io/instance=rke2-ingress-nginx
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=rke2-ingress-nginx
                app.kubernetes.io/part-of=rke2-ingress-nginx
                app.kubernetes.io/version=1.6.4
                helm.sh/chart=rke2-ingress-nginx-4.5.201
Annotations:    deprecated.daemonset.template.generation: 2
                field.cattle.io/publicEndpoints:
                  [{"nodeName":":REDACTED","addresses":["REDACTED"],"port":80,"protocol":"TCP","podName":"kube-system:rke2-ingre...
                meta.helm.sh/release-name: rke2-ingress-nginx
                meta.helm.sh/release-namespace: kube-system
Desired Number of Nodes Scheduled: 4
Current Number of Nodes Scheduled: 4
Number of Nodes Scheduled with Up-to-date Pods: 4
Number of Nodes Scheduled with Available Pods: 4
Number of Nodes Misscheduled: 0
Pods Status:  4 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/component=controller
                    app.kubernetes.io/instance=rke2-ingress-nginx
                    app.kubernetes.io/name=rke2-ingress-nginx
  Service Account:  rke2-ingress-nginx
  Containers:
   rke2-ingress-nginx-controller:
    Image:       rancher/nginx-ingress-controller:nginx-1.6.4-hardened4
    Ports:       80/TCP, 443/TCP, 8443/TCP
    Host Ports:  80/TCP, 443/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --election-id=rke2-ingress-nginx-leader
      --controller-class=k8s.io/ingress-nginx
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/rke2-ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
      --watch-ingress-without-class=true
      --enable-ssl-passthrough=true
    Requests:
      cpu:      100m
      memory:   90Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:        (v1:metadata.name)
      POD_NAMESPACE:   (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /usr/local/certificates/ from webhook-cert (ro)
  Volumes:
   webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rke2-ingress-nginx-admission
    Optional:    false
Events:          <none>
---
# kubectl -n kube-system describe svc rke2-ingress-nginx-controller-admission
#
Name:              rke2-ingress-nginx-controller-admission
Namespace:         kube-system
Labels:            app.kubernetes.io/component=controller
                   app.kubernetes.io/instance=rke2-ingress-nginx
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/name=rke2-ingress-nginx
                   app.kubernetes.io/part-of=rke2-ingress-nginx
                   app.kubernetes.io/version=1.6.4
                   helm.sh/chart=rke2-ingress-nginx-4.5.201
Annotations:       meta.helm.sh/release-name: rke2-ingress-nginx
                   meta.helm.sh/release-namespace: kube-system
Selector:          app.kubernetes.io/component=controller,app.kubernetes.io/instance=rke2-ingress-nginx,app.kubernetes.io/name=rke2-ingress-nginx
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                REDACTED
IPs:               REDACTED
Port:              https-webhook  443/TCP
TargetPort:        webhook/TCP
Endpoints:         CONTROLPLANE:8443,WORKER1:8443,WORKER2:8443 + 1 more...
Session Affinity:  None
Events:            <none>
---
Name:         rke2-ingress-nginx-controller-admission
Namespace:    kube-system
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=rke2-ingress-nginx
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=rke2-ingress-nginx
              app.kubernetes.io/part-of=rke2-ingress-nginx
              app.kubernetes.io/version=1.6.4
              helm.sh/chart=rke2-ingress-nginx-4.5.201
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2023-06-08T10:05:18Z
Subsets:
  Addresses:          CONTROLPLANE,WORKER1,WORKER2,WORKER3
  NotReadyAddresses:  <none>
  Ports:
    Name           Port  Protocol
    ----           ----  --------
    https-webhook  8443  TCP

Events:  <none>
---
# kubectl -n kube-system describe endpointslices.discovery.k8s.io rke2-ingress-nginx-controller-admission-6hrb4

Name:         rke2-ingress-nginx-controller-admission-6hrb4
Namespace:    kube-system
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=rke2-ingress-nginx
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=rke2-ingress-nginx
              app.kubernetes.io/part-of=rke2-ingress-nginx
              app.kubernetes.io/version=1.6.4
              endpointslice.kubernetes.io/managed-by=endpointslice-controller.k8s.io
              helm.sh/chart=rke2-ingress-nginx-4.5.201
              kubernetes.io/service-name=rke2-ingress-nginx-controller-admission
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2023-06-08T10:05:18Z
AddressType:  IPv4
Ports:
  Name           Port  Protocol
  ----           ----  --------
  https-webhook  8443  TCP
Endpoints:
  - Addresses:
    Conditions:
      Ready:    true
    Hostname:   <unset>
    TargetRef:  Pod/rke2-ingress-nginx-controller-2cbgb
    NodeName:   CONTROLPLANE
    Zone:       <unset>
  - Addresses:  REDACTED
    Conditions:
      Ready:    true
    Hostname:   <unset>
    TargetRef:  Pod/rke2-ingress-nginx-controller-lrjjr
    NodeName:   WORKER2
    Zone:       <unset>
  - Addresses:  REDACTED
    Conditions:
      Ready:    true
    Hostname:   <unset>
    TargetRef:  Pod/rke2-ingress-nginx-controller-ps5gd
    NodeName:   WORKER3
    Zone:       <unset>
  - Addresses:  REDACTED
    Conditions:
      Ready:    true
    Hostname:   <unset>
    TargetRef:  Pod/rke2-ingress-nginx-controller-zql7h
    NodeName:   WORKER1
    Zone:       <unset>
Events:         <none>
---
# Apps
# kubectl -n $namespace describe ingress ingress-1
Name:             ingress-1
Labels:           app.kubernetes.io/component=app
                  app.kubernetes.io/instance=SERVICE1
                  app.kubernetes.io/name=SERVICE1
Namespace:        $NAMESPACE
Address:          WORKER1,WORKER2,WORKER3,CONTROLPLANE
Ingress Class:    nginx
Default backend:  <default>
TLS:
  tls-server terminates HOSTNAME
Rules:
  Host                              Path  Backends
  ----                              ----  --------
  HOSTNAME
                                    /   SERVICE1:8888 (REDACTED:8888)
Annotations:                        field.cattle.io/publicEndpoints:
                                      [{"addresses":["WORKER1","CONTROLPLANE","WORKER3","WORKER2"],"port":443,"protocol":"HTTPS","serviceName":"$NAMESPACE:...
                                    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
                                    nginx.ingress.kubernetes.io/ingress.class: nginx
                                    nginx.ingress.kubernetes.io/rewrite-target: /
                                    nginx.ingress.kubernetes.io/ssl-passthrough: true
Events:                             <none>

longwuyuan commented 1 year ago

@rdb0101 thanks

(1) This is not a image released by this project

rancher/nginx-ingress-controller:nginx-1.6.4-hardened4

but it is almost certain that it is built using the code from this project. Maybe its better to discuss this issue in the rancher channel of kubernetes.slack.com .

(2) You are using "ssl-passthrough" annotation so termination of connection will occur in the backend pod. Hence I am not sure if the "backend-protocol" annotation has any use here.

(3) Your "path" field values is seen as "/" so not sure why you need to set "rewrite-target" annotation to "/"

(4) You have redacted some information in ingress description as "WORKER1...". That is ideally be a CNI network ipaddress, in the backend column. And it is ideally the external-ip in the address field. Having a NODE ipaddress has different implicaitons.

(5) I actually typed out the commands for you to use for proviidng info but once again you ignored that suggestion and showed info that you think is relevant, in case of endpointslices.

The most critical info is the curl command, as you executed it, with -v and its response, and the log message of the controller pod, when the curl request fails. So I am not even sure what the failure looks like.

If there was a pod-restart, then it would be visible in the restart colum of kubectl get pod .... but none of that info is provided.

So not sure how to proceed.

/kind support

matthewbrumpton commented 1 year ago

@longwuyuan, adding the following parameter to helm install has resolved my issue.

--set defaultBackend.nodeSelector."kubernetes.io/os"=linux

longwuyuan commented 1 year ago

@matthewbrumpton thanks for the update. @rdb0101 hope this comment from @matthewbrumpton helps.

I need data and logs to co-relate the problem and solution.

rdb0101 commented 1 year ago

@longwuyuan thank you for the comment I am going to test it and see if this resolves the issue!

rdb0101 commented 1 year ago

@matthewbrumpton Since I am using the rke2-server/rke2-agent setup which has the builtin ingress-controller. I edited the daemonset.apps for the controller and added the argument you have provided. So far, within the last 30-45 minutes I have not experienced any disconnections to the services' GUIs, no 404 errors, and no reoccurrence of the original error of discussion. I am going to continue monitoring the services before I can definitely say it has been resolved; but things are looking up! I was wondering how did you manage to figure this out? Why does adding that tag resolve the issue? Sorry for all of the questions - I really appreciate both of your help @longwuyuan and @matthewbrumpton !

matthewbrumpton commented 1 year ago

@rdb0101, glad the update is also working for you

After reviewing microsoft docs I noticed they were using this tag

https://learn.microsoft.com/en-us/azure/aks/ingress-basic?tabs=azure-cli#create-an-ingress-controller

Not sure why missing this tag causes the issue, I would also find an explanation useful

longwuyuan commented 1 year ago

That annotation obviously implies that Kubernetes does npt see the nodes as linux nodes...but its so strange. Getting the root cause and adding it in the docs will be a good.

rdb0101 commented 1 year ago

Too good to be true :( it seems the issue is still reoccurring. I applied it as an argument but it seems to be an annotation. I think if I can apply the correct annotation then it should work theoretically speaking (hopefully). @matthewbrumpton are you running this on azure? I am currently running on ec2 with a deployment of RKE2. Did you also apply it as an annotation? What are your thoughts? @longwuyuan I also agree that it is very strange and the behavior stopped for several hours after applying the changes - so I think we are on the right track to a resolution!

longwuyuan commented 1 year ago

@rdb0101 what I observe is a simple problem before the actual problem.

You said "the issue is still reoccuring" but tht means you had one opportunity to post this information ;

kubectl -n ingress-nginx get all -o wide
kubectl -n ingress-nginx describe po $controllerpodname
kubectl -n ingress-nginx describe svc ingress-nginx-controller
kubectl -n $appnamespace get all,ing -o wide
kubectl -n $appnamespace describe ing $ingressname
kubectl -n $appnamespace describe svc $appsvcname
kubectl -n $appnamespace logs $apppodname
Complete real and exact curl command executed with -v and its complete response
kubectl -n ingress-nginx logs $controllerpodname

But that information captured when the problem happened is not posted here so there is no way way to make any analysis.

Also, the controller is deployed by too many users in too many different environments and too many configurations. So if the controller had a major breaking bug like broken endpoints caused by controller, then a lot of users would have reported this. So currently this looks like a problem needing triaging. And there is lack of resources here and there are more resources on kubernetes.slack.com. Better to discuss on slack instead of discussing here without any practically useful activity like posting data to be analyzed

matthewbrumpton commented 1 year ago

@rdb0101, running on Azure AKS with helm install, posted an example deployment earlier in the chat

rdb0101 commented 1 year ago

@matthewbrumpton Got it thank you for clarifying that!

rdb0101 commented 1 year ago

I did run the curl command while the issue was happening, please see the output below:

 curl -I -k https://hostname.domain.org -L -v
*   Trying hostname:443...
* Connected to hostname.domain.org (REDACTED) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*  CAfile: /etc/pki/tls/certs/ca-bundle.crt
*  CApath: none
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / REDACRED
* ALPN, server accepted to use h2
* Server certificate:
*  subject: REDACTED
*  start date: Oct 18 15:52:28 2022 GMT
*  expire date: Oct 17 15:52:28 2025 GMT
*  issuer: REDACTED
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x1c425c0)
> HEAD / HTTP/2
> Host: REDACTED
> user-agent: curl/7.79.1
> accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 200
HTTP/2 200
< cache-control: no-cache
cache-control: no-cache
< date: Wed, 14 Jun 2023 14:16:16 GMT
date: Wed, 14 Jun 2023 14:16:16 GMT

<
* Connection #0 to host REDACTED left intact

longwuyuan commented 1 year ago

@rdb0101 the response code is 200 so its not obvious as to what was expected and what the problem is

rdb0101 commented 1 year ago

@longwuyuan I did find a work around to this issue, via nodeport. When using nodeport all of the services are accessible, concurrently as the error occurs. So the service(s) impacted are not directly accessible without using nodeport. Do you have any idea why this might be the case, or if there is something I am doing wrong in the configuration? What are your thoughts?

matthewbrumpton commented 1 year ago

@longwuyuan , this error has returned when adding more than one user agentpool in AKS

matthewbrumpton commented 1 year ago

@longwuyuan, AKS cannot assign a public ip address to nginx when there are multiple user agentpools

2 x user agent pools

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE nginx-ingress-ingress-nginx-controller LoadBalancer 172.16.241.107 pending 80:30554/TCP,443:32148/TCP 17m nginx-ingress-ingress-nginx-controller-admission ClusterIP 172.16.85.105 none 443/TCP 17m nginx-ingress-ingress-nginx-controller-metrics ClusterIP 172.16.159.219 none 10254/TCP

1 x user agent pools

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE nginx-ingress-ingress-nginx-controller LoadBalancer 172.16.241.107 10.255.32.5 80:30554/TCP,443:32148/TCP 36m nginx-ingress-ingress-nginx-controller-admission ClusterIP 172.16.85.105 none 443/TCP 36m nginx-ingress-ingress-nginx-controller-metrics ClusterIP 172.16.159.219 none 10254/TCP 36m

This is on AKS 1.27.3

helm install nginx-ingress ingress-nginx/ingress-nginx --version 4.7.1 --create-namespace --namespace aks-namespace --set controller.replicaCount=1 --set controller.metrics.enabled=true --set controller.nodeSelector."nodepool"=app --set defaultBackend.nodeSelector."kubernetes\.io/os"=linux --set controller.admissionWebhooks.patch.nodeSelector."kubernetes.io/os"=linux --set controller.service.loadBalancerIP=10.255.32.5 --set controller.service.annotations."service.beta.kubernetes.io/azure-load-balancer-internal"=true --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz --set controller.metrics.serviceMonitor.additionalLabels.release="prometheus"

I0912 09:22:11.387290 I0912 09:22:11.520778 I0912 09:22:11.539209 I0912 09:22:11.547000 I0912 09:22:11.553791 I0912 09:22:12.748232 I0912 09:22:12.748231 I0912 09:22:12.748569 I0912 09:22:12.748763 I0912 09:22:12.761513 I0912 09:22:12.761601 I0912 09:22:12.783885 I0912 09:22:12.784201 I0912 09:22:12.784244 W0912 09:24:27.350448 I0912 09:24:27.370717 I0912 09:24:27.370743 I0912 09:24:27.375519 I0912 09:24:27.375807 W0912 09:24:30.212619 I0912 09:24:30.212726 I0912 09:24:30.260737 I0912 09:24:30.260847 W0912 09:26:49.861159 I0912 09:26:49.882401 I0912 09:26:49.882436 I0912 09:26:49.886716 I0912 09:26:49.886851 W0912 09:26:52.730945 I0912 09:26:52.731041 I0912 09:26:52.776189 I0912 09:26:52.776373 W0912 09:32:55.690960 W0912 09:32:59.024414 7 main.go:253] "Running in Kubernetes cluster" major="1" minor="27" git="v1.27.3" state="clean" commit="25b4e43193bcda6c7328a6d147b1fb73a33f1598" platform="linux/amd64" 7 main.go:104] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem" 7 ssl.go:533] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key" 7 nginx.go:261] "Starting NGINX Ingress controller" 7 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"bmdev-ne-nginx", Name:"nginx-ingress-ingress-nginx-controller", UID:"513bbd3a-d4cb-4bb6-9fc4-45d1218c90ed", APIVersion:"v1", ResourceVersion:"3608", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap bmdev-ne-nginx/nginx-ingress-ingress-nginx-controller 7 leaderelection.go:248] attempting to acquire leader lease bmdev-ne-nginx/nginx-ingress-ingress-nginx-leader... 7 nginx.go:304] "Starting NGINX process" 7 nginx.go:324] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key" 7 controller.go:190] "Configuration changes detected, backend reload required" 7 leaderelection.go:258] successfully acquired lease bmdev-ne-nginx/nginx-ingress-ingress-nginx-leader 7 status.go:84] "New leader elected" identity="nginx-ingress-ingress-nginx-controller-ff58f7d45-pdw2f" 7 controller.go:207] "Backend successfully reloaded" 7 controller.go:218] "Initial sync, sleeping for 1 second" 7 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bmdev-ne-nginx", Name:"nginx-ingress-ingress-nginx-controller-ff58f7d45-pdw2f", UID:"2242a9ae-03e8-4fc3-95cd-6b6f52374f20", APIVersion:"v1", ResourceVersion:"3641", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration 7 controller.go:1207] Service "bmdev-ne-dash/bmdev-ne-dash-service" does not have any active Endpoint. 7 admission.go:149] processed ingress via admission controller {testedIngressLength:1 testedIngressTime:0.02s renderingIngressLength:1 renderingIngressTime:0s admissionTime:18.3kBs testedConfigurationSize:0.02} 7 main.go:110] "successfully validated configuration, accepting" ingress="bmdev-ne-dash/bmdev-ne-dash-ingress" 7 store.go:432] "Found valid IngressClass" ingress="bmdev-ne-dash/bmdev-ne-dash-ingress" ingressclass="nginx" 7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"bmdev-ne-dash", Name:"bmdev-ne-dash-ingress", UID:"f207ffa1-7233-44a4-9251-caeeec73431e", APIVersion:"networking.k8s.io/v1", ResourceVersion:"4547", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync 7 controller.go:1207] Service "bmdev-ne-dash/bmdev-ne-dash-service" does not have any active Endpoint. 7 controller.go:190] "Configuration changes detected, backend reload required" 7 controller.go:207] "Backend successfully reloaded" 7 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bmdev-ne-nginx", Name:"nginx-ingress-ingress-nginx-controller-ff58f7d45-pdw2f", UID:"2242a9ae-03e8-4fc3-95cd-6b6f52374f20", APIVersion:"v1", ResourceVersion:"3641", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration 7 controller.go:1207] Service "bmdev-ne-contractservice/bmdev-ne-contractservice-service" does not have any active Endpoint. 7 admission.go:149] processed ingress via admission controller {testedIngressLength:2 testedIngressTime:0.021s renderingIngressLength:2 renderingIngressTime:0.001s admissionTime:26.2kBs testedConfigurationSize:0.022} 7 main.go:110] "successfully validated configuration, accepting" ingress="bmdev-ne-contractservice/bmdev-ne-contractservice-ingress" 7 store.go:432] "Found valid IngressClass" ingress="bmdev-ne-contractservice/bmdev-ne-contractservice-ingress" ingressclass="nginx" 7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"bmdev-ne-contractservice", Name:"bmdev-ne-contractservice-ingress", UID:"4a1a00b1-ca45-49eb-bf8a-0936508426b0", APIVersion:"networking.k8s.io/v1", ResourceVersion:"5464", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync 7 controller.go:1207] Service "bmdev-ne-contractservice/bmdev-ne-contractservice-service" does not have any active Endpoint. 7 controller.go:190] "Configuration changes detected, backend reload required" 7 controller.go:207] "Backend successfully reloaded" 7 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"bmdev-ne-nginx", Name:"nginx-ingress-ingress-nginx-controller-ff58f7d45-pdw2f", UID:"2242a9ae-03e8-4fc3-95cd-6b6f52374f20", APIVersion:"v1", ResourceVersion:"3641", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration 7 controller.go:1207] Service "bmdev-ne-dash/bmdev-ne-dash-service" does not have any active Endpoint. 7 controller.go:1207] Service "bmdev-ne-dash/bmdev-ne-dash-service" does not have any active Endpoint.

longwuyuan commented 1 year ago

Hi @matthewbrumpton I am not able to test this as I don't have Azure logins for these tests.

But it will help to know if the documented process to install on Azure works (in lieu of the helm chart)

StefanLobbenmeierObjego commented 10 months ago

this error has returned when adding more than one user agentpool in AKS

FYI - I just saw this error in our cluster about 20 hours after changing out nodes via https://learn.microsoft.com/en-us/azure/aks/resize-node-pool?tabs=azure-cli, but in our case there was just one agentpool when the error occured. I guess there might still be some relation.

(using helm chart 4.9.0 with image registry.k8s.io/ingress-nginx/controller:v1.9.5)

mconigliaro commented 8 months ago

I don't have anything really useful to add, but I think we're having the same problem on EKS. We're seeing these Service "<name>" does not have any active Endpoint messages in our logs around the time nginx is responding with 503s. It seems to happen randomly, then it magically fixes itself after a few seconds (which I'm guessing is why people are having a hard time running those kubectl commands while it's broken).

Our Helm chart is installed by Argo CD:

source:
  repoURL: 'https://kubernetes.github.io/ingress-nginx'
  targetRevision: 4.9.0
  helm:
    parameters:
      - name: controller.service.type
        value: NodePort
      - name: controller.service.nodePorts.http
        value: '30080'
      - name: controller.service.nodePorts.https
        value: '30443'
      - name: controller.config.proxy-buffer-size
        value: 16k
  chart: ingress-nginx

Note that we did not have to set the following options because they're already set this way by default:

controller.nodeSelector."kubernetes.io/os"=linux
controller.admissionWebhooks.patch.nodeSelector."kubernetes\.io/os"=linux

StefanLobbenmeierObjego commented 8 months ago

What also makes this hard is that this message also appears when all your pods are unhealthy and marked as Non-Ready, so there might be some false positives when just looking for that message.

mconigliaro commented 8 months ago

For what its worth, someone in #6135 said:

The latest Helm chart that works for me is chart version 4.2.5, containing app version 1.3.1.

I tried downgrading to this version, but the problem still occurs.

mconigliaro commented 8 months ago

Wow, this may have been an issue as early as 2018: https://github.com/kubernetes/ingress-nginx/issues/3060#issuecomment-446956212

mconigliaro commented 8 months ago

The reporter of #6962 says this started happening when he added port names to his service. We're using port names, and all the manifest examples I see in this thread have port names. Does anyone have an example of this happening without port names?

longwuyuan commented 8 months ago

Until we have some way to reproduce or some helpful data that is convincing, I am not sure what a developer would do to address this issue

mconigliaro commented 8 months ago

I agree. 6 years of bug reports isn't convincing. We need a few more years. 😂

longwuyuan commented 8 months ago

Its OSS so your sentiment is ack'd.

If you can help me reproduce, I'll appreciate it

mconigliaro commented 8 months ago

I'm just agreeing that 6 years of bug reports is not nearly enough time to be "convincing." I think people are coming here to report same problem over and over for fun. And honestly, who can blame them? It really is great fun! 😂

I posted the helm chart I'm using with params above. Seems like a pretty basic setup. If I really wanted to reproduce this, I'd just deploy some kind of hello world app and slam it with requests until the problem occurred. I'd also pay close attention to what happens when I add/remove other hello world apps in the same cluster (all of which are being proxied by the same ingress-nginx instance of course). I just don't have the time to do that right now, and I'm guessing neither does anyone else.

In the meantime, the best clue I have is that port name thing. When I have some time, I'll try removing the port names from the helm chart in my own app and see if that makes a difference. But before I take the time to do that, hopefully someone else will chime in and let us know if they're seeing this problem without port names.

I don't know a lot about Kubernetes internals, so this is a total shot in the dark based on dealing with DNS issues for more years than I have fingers and toes, but the more I dig into this, the more this smells like yet another DNS issue to me...

SRV Records are created for named ports that are part of normal or headless services. For each named port, the SRV record has the form _port-name._port-protocol.my-svc.my-namespace.svc.cluster-domain.example. For a regular Service, this resolves to the port number and the domain name: my-svc.my-namespace.svc.cluster-domain.example. ...

https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#srv-records

strongjz commented 8 months ago

@mconigliaro I'd be interested to see the testing without named ports.

mconigliaro commented 8 months ago

I'm sad to report that the problem still occurs when using port numbers instead of names, but I'm happy to report that it's easily reproducible. I can also say the Service "<name>" does not have any active Endpoint message definitely seems correlated.

I made a script to run all the commands in https://github.com/kubernetes/ingress-nginx/issues/9932#issuecomment-1589652087, but it takes way too long to run (20+ secs), and that's longer than the window in which the problem occurs, so I doubt most of the data will be valid. What are the most important commands I should run?

longwuyuan commented 8 months ago

in which resource's spec did you use port numbers instead of names for ports ?

mconigliaro commented 8 months ago

I had names in my service (as described in https://github.com/kubernetes/ingress-nginx/issues/6962), deployment, and ingress. I just tried to remove the names everywhere I could find them.

mconigliaro commented 8 months ago

I'm now able to reproduce this pretty easily with a simple bash while loop:

while curl -v --fail $curlurl; do echo; done
kubectl --context $context -n $appnamespace describe svc $appsvcname

Everything looks fine until suddenly...

10.4.150.142 - - [29/Feb/2024:18:51:06 +0000] "GET /healthcheck HTTP/1.1" 200 75 "-" "curl/8.4.0" 256 0.069 [cmd-eph-mb-503-heartbreat-webapp-3000] [] 10.4.146.28:3000 75 0.072 200 09e0e90823fb7bd53b9982d97cc10d3d
10.4.139.35 - - [29/Feb/2024:18:51:06 +0000] "GET /healthcheck HTTP/1.1" 200 75 "-" "curl/8.4.0" 256 0.074 [cmd-eph-mb-503-heartbreat-webapp-3000] [] 10.4.146.28:3000 75 0.072 200 eca8deb8bd9b54cf9a85c997a03f145f
10.4.153.22 - - [29/Feb/2024:18:51:07 +0000] "GET /healthcheck HTTP/1.1" 200 75 "-" "curl/8.4.0" 256 0.106 [cmd-eph-mb-503-heartbreat-webapp-3000] [] 10.4.146.28:3000 75 0.108 200 2ff199638f5f56aa418186c93ddd2481
10.4.157.153 - - [29/Feb/2024:18:51:07 +0000] "GET /healthcheck HTTP/1.1" 200 75 "-" "curl/8.4.0" 256 0.116 [cmd-eph-mb-503-heartbreat-webapp-3000] [] 10.4.146.28:3000 75 0.116 200 e8cbbd94df388ee5579e0c658e3fffa0
10.4.150.142 - - [29/Feb/2024:18:51:08 +0000] "GET /healthcheck HTTP/1.1" 200 75 "-" "curl/8.4.0" 256 0.254 [cmd-eph-mb-503-heartbreat-webapp-3000] [] 10.4.146.28:3000 75 0.252 200 4ffc2e0b50ca5cd3c797295ef66a407f
W0229 18:51:08.885353       8 controller.go:1112] Service "cmd-eph-mb-503-heartbreat/webapp" does not have any active Endpoint.

curl fails a second later...

* Trying 10.4.138.132:443...
* Connected to cmd-eph-mb-503-heartbreat-app.dev.redacted.io (10.4.138.132) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=*.dev.redacted.io
*  start date: Jul  5 00:00:00 2023 GMT
*  expire date: Aug  2 23:59:59 2024 GMT
*  subjectAltName: host "cmd-eph-mb-503-heartbreat-app.dev.redacted.io" matched cert's "*.dev.redacted.io"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02
*  SSL certificate verify ok.
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://cmd-eph-mb-503-heartbreat-app.dev.redacted.io/healthcheck
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: cmd-eph-mb-503-heartbreat-app.dev.redacted.io]
* [HTTP/2] [1] [:path: /healthcheck]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> GET /healthcheck HTTP/2
> Host: cmd-eph-mb-503-heartbreat-app.dev.redacted.io
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/2 503
< date: Thu, 29 Feb 2024 18:51:09 GMT
< content-type: text/html
< content-length: 190
< strict-transport-security: max-age=15724800; includeSubDomains
* The requested URL returned error: 503
* Connection #0 to host cmd-eph-mb-503-heartbreat-app.dev.redacted.io left intact
curl: (22) The requested URL returned error: 503

Where did my endpoint go?

Name:                     webapp
Namespace:                cmd-eph-mb-503-heartbreat
Labels:                   app.kubernetes.io/component=webapp
                          app.kubernetes.io/instance=cmd-eph-mb-503-heartbreat
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=cmd-webapp
                          argocd.argoproj.io/instance=cmd-eph-mb-503-heartbreat
                          helm.sh/chart=cmd-webapp-0.1.0
Annotations:              <none>
Selector:                 app.kubernetes.io/component=webapp,app.kubernetes.io/instance=cmd-eph-mb-503-heartbreat,app.kubernetes.io/name=cmd-webapp
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.20.186.24
IPs:                      172.20.186.24
Port:                     <unset>  3000/TCP
TargetPort:               3000/TCP
NodePort:                 <unset>  30837/TCP
Endpoints:
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

But then it magically comes back a second or two later?

Name:                     webapp
Namespace:                cmd-eph-mb-503-heartbreat
Labels:                   app.kubernetes.io/component=webapp
                          app.kubernetes.io/instance=cmd-eph-mb-503-heartbreat
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=cmd-webapp
                          argocd.argoproj.io/instance=cmd-eph-mb-503-heartbreat
                          helm.sh/chart=cmd-webapp-0.1.0
Annotations:              <none>
Selector:                 app.kubernetes.io/component=webapp,app.kubernetes.io/instance=cmd-eph-mb-503-heartbreat,app.kubernetes.io/name=cmd-webapp
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.20.186.24
IPs:                      172.20.186.24
Port:                     <unset>  3000/TCP
TargetPort:               3000/TCP
NodePort:                 <unset>  30837/TCP
Endpoints:                10.4.146.28:3000
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

Let me know what other info might be helpful, but note that I only have a second or two to catch it.

mconigliaro commented 8 months ago

OK, it turns out even a second or two is not small enough of a window to catch this most of the time. I now have commands running in two separate terminals:

wrk https://cmd-eph-mb-503-heartbreat-app.dev.redacted.io/healthcheck -c 20 -d 60

while kubectl --context $context -n $appnamespace describe svc $appsvcname; do echo; done

When I do this, I definitely see the Endpoint 10.4.146.28:3000 disappearing and reappearing randomly. I now believe this is load related, since it happens much more frequently if I increase the number of wrk connections (e.g. -c 200).

strongjz commented 8 months ago

Does this still happen on 1.9.X and 1.10.0?

mconigliaro commented 8 months ago

I just upgraded to helm chart 4.10.0 and it's still happening.

NGINX Ingress controller
  Release:       v1.10.0
  Build:         71f78d49f0a496c31d4c19f095469f3f23900f8a
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.25.3

But what I'm not sure of is whether nginx is causing the problem or just revealing it. What would cause nginx to remove endpoints from services like that? Seems unlikely, but this is also the only place we're seeing this problem (we only use nginx to proxy to our ephemeral dev environments, and we use AWS load balancers in production). And it's interesting that other people seem to be reporting similar behavior.

mconigliaro commented 8 months ago

I'm back, and I'm now 99% sure the root cause was that we were running out of IP addresses in our EKS cluster. I killed a bunch of unnecessary pods and the random 503s and the "active Endpoint" message went away. I never found any error messages about this in our EKS logs, and I never saw anything else complaining. I only figured it out when I saw a suspicious-looking message about IP addresses on one of our services while poking around the cluster with Lens. Somehow, the only clue that something was wrong at the cluster level was this error message in the nginx controller logs. I'll bet there are a whole bunch of things that might trigger this message (which would explain the six years of bug reports). Apologies for defaming DNS, and thanks to nginx for this error message!

longwuyuan commented 8 months ago

/assign

longwuyuan commented 8 months ago

in that case maybe a very small subnet configured on minikube or kind and manually exhausting the ip-addresses could potentially reproduce the error message

akalinux commented 7 months ago

I am having the same issue. Is there any progress on this?

W0326 19:32:27.115911       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"err":"secrets \"ingress-nginx-admission\" not found","level":"info","msg":"no secret found","source":"k8s/k8s.go:229","time":"2024-03-26T19:32:27Z"}
{"level":"info","msg":"creating new secret","source":"cmd/create.go:28","time":"2024-03-26T19:32:27Z"}
W0326 19:32:40.821000       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"level":"info","msg":"patching webhook configurations 'ingress-nginx-admission' mutating=false, validating=true, failurePolicy=Fail","source":"k8s/k8s.go:118","time":"2024-03-26T19:32:40Z"}
{"level":"info","msg":"Patched hook(s)","source":"k8s/k8s.go:138","time":"2024-03-26T19:32:40Z"}
I0326 19:34:00.876850       7 event.go:298] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"namespace-socmon-common", Name:"ingress-socmon-webapps-3-dev", UID:"f4ebe45d-da26-4259-8cdf-d996b8cf3e41", APIVersion:"networking.k8s.io/v1", ResourceVersion:"773", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
W0326 19:34:04.206812       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-3" does not have any active Endpoint.
W0326 19:34:04.206855       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-2" does not have any active Endpoint.
W0326 19:34:04.206868       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-1" does not have any active Endpoint.
W0326 19:34:07.543761       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-3" does not have any active Endpoint.
W0326 19:34:07.543796       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-2" does not have any active Endpoint.
W0326 19:34:07.543809       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-1" does not have any active Endpoint.
W0326 19:34:40.233379       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-3" does not have any active Endpoint.
W0326 19:34:40.233405       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-2" does not have any active Endpoint.
W0326 19:34:40.233415       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-1" does not have any active Endpoint.

NAME                       TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
service-socmon-webapps-1   NodePort   10.97.236.80    <none>        80:32765/TCP   39m
service-socmon-webapps-2   NodePort   10.99.7.70      <none>        80:32011/TCP   39m
service-socmon-webapps-3   NodePort   10.102.15.166   <none>        80:31318/TCP   39m

Each service can be reached from inside each container, and the services have never restarted.

*   Trying 10.97.236.80:80...
* Connected to service-socmon-webapps-1 (10.97.236.80) port 80 (#0)
> GET /rest/healthcheck HTTP/1.1
> Host: service-socmon-webapps-1
> User-Agent: curl/7.74.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx
< Date: Tue, 26 Mar 2024 20:16:24 GMT
< Content-Type: text/plain; charset=UTF-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept-Encoding
< Set-Cookie: dancer.session=ZgMtFwAAAIts-zMs87oVwpS0U_0uNFjM; Path=/; SameSite=Lax; HttpOnly; Secure; Expires=Mon, 25-Mar-2024 20:16:24 GMT; Domain=service-socmon-webapps-1
< Cache-Control: private, no-cache, no-store, must-revalidate
< X-Frame-Options: sameorigin
< X-XSS-Protection: 1; mode=block
< 
* Connection #0 to host service-socmon-webapps-1 left intact
OK

Further still each container maintains a peer to peer persistent websocket connection between the nodes.. All the services are up and working between the containers. So the services are working just fine, but for some reason the ingress server thinks they are down?

etstat -an|grep tcp|grep ESTA|grep ':80'
tcp        0      0 10.244.0.9:36172        10.97.236.80:80         ESTABLISHED
tcp        0      0 10.244.0.9:57410        10.99.7.70:80           ESTABLISHED
tcp        0      0 10.244.0.9:80           10.244.0.8:40938        ESTABLISHED
tcp        0      0 10.244.0.9:80           10.244.0.7:33442        ESTABLISHED

akalinux commented 7 months ago

I have an odd update.. If I removed the domain name from the ingress files then the ingress server starts working.. I am guessing this has something to do with dns.

example.txt

Unrelated to this issue.. I am having issues with the nginx container ignoring the tls cert.. no idea why.. it just ignores the secret. ( ya I know this is the wrong place to mention this )

debdutdeb commented 4 months ago

I have an odd update.. If I removed the domain name from the ingress files then the ingress server starts working.. I am guessing this has something to do with dns.

This is the exact behaviour we're seeing right now, chart 4.10, app 1.10

We've been at it for hours

kubernetes / ingress-nginx

Reoccurrence of Service does not have any active Endpoint [when it actually does] #9932