linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.48k stars 1.26k forks source link

Pods are suddenly getting stuck in crash loop (edge-20.9.1) #4973

Closed Abrishges closed 3 years ago

Abrishges commented 3 years ago

Bug Report

What is the issue?

Upgrade our linkerd version from stable-2.7.1 to edge-2.9.1 was resulting application container going into a crash loop. Crash loop happened to many of our application pods in many regions, but it’s happened in few pods in a namespace. Crash loop happened to few pods in some namespace.

How can it be reproduced?

  1. upgrade linkerd to edge-2.9.1
  2. redeploy your application, and crash loop might happen in few of your pods randomly

Logs, error output, etc

NAME                               READY   STATUS             RESTARTS   AGE
appxyz-dpl-88f47658f-2h8jf   1/2     Running            0          4d1h
appxyz-dpl-88f47658f-4msjg   1/2     Running            0          4d1h
appxyz-dpl-88f47658f-655n7   1/2     Running            0          4d1h
appxyz-dpl-88f47658f-7j49k   1/2     Running            0          3d7h
appxyz-dpl-88f47658f-dh6ct   1/2     CrashLoopBackOff   413        4d1h
appxyz-dpl-88f47658f-drsht   1/2     Running            0          4d1h
appxyz-dpl-88f47658f-g2gp2   1/2     Running            0          3d3h
appxyz-dpl-88f47658f-gl9gv   1/2     Running            0          4d1h
appxyz-dpl-88f47658f-k7fmh   1/2     Running            0          3d7h
appxyz-dpl-88f47658f-m2frb   1/2     Running            0          3d7h
appxyz-dpl-88f47658f-tj9sv   1/2     Running            0          4d1h

pod describe

Events:
  Type     Reason     Age                     From                                       Message
  ----     ------     ----                    ----                                       -------
  Normal   Pulled     36m (x406 over 29h)     kubelet, aks-services-15901661-vmss00000a  Container image "prodwus2.azurecr.io/services.appxyz:1.0.2009.0903" already present on machine
  Warning  Unhealthy  26m (x1225 over 29h)    kubelet, aks-services-15901661-vmss00000a  Liveness probe failed: Get https://172.21.12.125:10443/v1/health/live: dial tcp 172.21.12.125:10443: connect: connection refused
  Warning  BackOff    6m18s (x4980 over 29h)  kubelet, aks-services-15901661-vmss00000a  Back-off restarting failed container
  Warning  Unhealthy  77s (x4103 over 29h)    kubelet, aks-services-15901661-vmss00000a  Readiness probe failed: Get https://172.21.12.125:10443/v1/health/ready: dial tcp 172.21.12.125:10443: connect: connection refused
  Containers:
  appxyz-ctr:
    Container ID:   docker://e894dc67b7f38728eb70c92d6a3aee0352ee30cef1614242b82f3c3760ff382e
    Image:          prodwus2.azurecr.io/services.appxyz:1.0.2009.0903
    Image ID:       docker-pullable://prodwus2.azurecr.io/services.appxyz@sha256:9db1b7266775e1450f9cf32841c263b056887b8a53bea4a30e6b41f1d5aa10a9
    Ports:          443/TCP, 10443/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Mon, 14 Sep 2020 19:06:46 -0700
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 14 Sep 2020 18:59:53 -0700
      Finished:     Mon, 14 Sep 2020 19:01:41 -0700
    Ready:          False
    Restart Count:  414
  linkerd-proxy:
    Container ID:   docker://30330e1f9687fd03c5df9f9535a9069dc77b5d3de3420ed6183c8c92275e5485
    Image:          gcr.io/linkerd-io/proxy:edge-20.9.1
    Image ID:       docker-pullable://gcr.io/linkerd-io/proxy@sha256:fb0e27e7be6d5464f8a719e5ff73b32e3f97f304b39aecf54f76b9cebcac6752
    Ports:          4143/TCP, 4191/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 10 Sep 2020 17:15:13 -0700
    Ready:          True
    Restart Count:  0

proxy container log

 kubectl logs appxyz-dpl-88f47658f-dh6ct -c linkerd-proxy -n appxyz
time="2020-09-11T00:15:13Z" level=info msg="running version edge-20.9.1"
thread 'main' panicked at 'inbound failed: Transport endpoint is not connected (os error 107)', /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/macros.rs:16:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

application container log

Content root path: /app
Now listening on: https://0.0.0.0:10443
Now listening on: https://0.0.0.0:443
Now listening on: http://0.0.0.0:8080
Application started. Press Ctrl+C to shut down.
{"ver":"2.1","name":"CommonWebApi.ApplicationLifetimeEvent","time":"2020-09-15T02:08:39.4993808Z","epoch":"1","seqNum":7,"os":"Unix","osVer":"4.15.0.1095","ext":{"cloud":{"ver":"1.1","name":"appxyz","role":"appxyzfd","roleInstance":"appxyz-dpl-88f47658f-dh6ct","location":"westus2","environment":"Stress"},"sll":{"libVer":"5.2.38","level":3}},"data":{"baseType":"Ms.CorrelationContext.CorrelationContext","baseData":{},"message":"Application started"}}
{"ver":"2.1","name":"HttpClientLoggingMessageHandlerEvents.ExternalRequestCompleted","time":"2020-09-15T02:08:39.5097412Z","epoch":"1","seqNum":9,"os":"Unix","osVer":"4.15.0.1095","cV":"NEBgx7bM80K2QHy5N2DiSA.1","ext":{"cloud":{"ver":"1.1","name":"appxyz","role":"appxyzfd","roleInstance":"appxyz-dpl-88f47658f-dh6ct","location":"westus2","environment":"Stress"},"sll":{"libVer":"5.2.38","level":3}},"data":{"baseType":"Ms.Qos.OutgoingServiceRequest","baseData":{"operationName":"ExternalRequest","targetUri":"https://auth-svc.auth:10443/v1/keys","latencyMs":24,"succeeded":true,"requestMethod":"GET","protocol":"HTTP","protocolStatusCode":"OK","dependencyOperationName":"JsonWebKeyResolver","dependencyName":"auth-svc.auth","requestStatus":3},"friendlyName":"JsonWebKeyResolver","message":"GET https://auth-svc.auth:10443/v1/keys : OK (24 ms)"}}
Application is shutting down... 

linkerd check output

kubernetes-api                                                                                                           
--------------                                                                                                           
√ can initialize the client                                                                                              
√ can query the Kubernetes API                                                                                           

kubernetes-version                                                                                                       
------------------                                                                                                       
√ is running the minimum Kubernetes API version                                                                          
√ is running the minimum kubectl version                                                                                 

linkerd-existence                                                                                                        
-----------------                                                                                                        
√ 'linkerd-config' config map exists                                                                                     
√ heartbeat ServiceAccount exist                                                                                         
√ control plane replica sets are ready                                                                                   
√ no unschedulable pods                                                                                                  
√ controller pod is running                                                                                              
PS C:\Users\abwodaji\work> xCloud linkerd check commandrelay                                                             
kubernetes-api                                                                                                           
--------------                                                                                                           
√ can initialize the client                                                                                              
√ can query the Kubernetes API                                                                                           

kubernetes-version                                                                                                       
------------------                                                                                                       
√ is running the minimum Kubernetes API version                                                                          
√ is running the minimum kubectl version                                                                                 

linkerd-existence                                                                                                        
-----------------                                                                                                        
√ 'linkerd-config' config map exists                                                                                     
√ heartbeat ServiceAccount exist                                                                                         
√ control plane replica sets are ready                                                                                   
√ no unschedulable pods                                                                                                  
√ controller pod is running                                                                                              
√ can initialize the client                                                                                              
√ can query the control plane API                                                                                        

linkerd-config                                                                                                           
--------------                                                                                                           
√ control plane Namespace exists                                                                                         
√ control plane ClusterRoles exist                                                                                       
√ control plane ClusterRoleBindings exist                                                                                
√ control plane ServiceAccounts exist                                                                                    
√ control plane CustomResourceDefinitions exist                                                                          
√ control plane MutatingWebhookConfigurations exist                                                                      
√ control plane ValidatingWebhookConfigurations exist                                                                    
√ control plane PodSecurityPolicies exist                                                                                

linkerd-identity                                                                                                         
----------------                                                                                                         
√ certificate config is valid                                                                                            
√ trust roots are using supported crypto algorithm                                                                       
√ trust roots are within their validity period                                                                           
√ trust roots are valid for at least 60 days                                                                             
√ issuer cert is using supported crypto algorithm                                                                        
√ issuer cert is within its validity period                                                                              
√ issuer cert is valid for at least 60 days                                                                              
√ issuer cert is issued by the trust root                                                                                

linkerd-identity-data-plane                                                                                              
---------------------------                                                                                              
√ data plane proxies certificate match CA                                                                                

linkerd-api                                                                                                              
-----------                                                                                                              
√ control plane pods are ready                                                                                           
√ control plane self-check                                                                                               
√ [kubernetes] control plane can talk to Kubernetes                                                                      
√ [prometheus] control plane can talk to Prometheus                                                                      
√ tap api service is running                                                                                             

linkerd-version                                                                                                          
---------------                                                                                                          
√ can determine the latest version                                                                                       
‼ cli is up-to-date                                                                                                      
    is running version 2.7.1 but the latest stable version is 2.8.1                                                      
    see https://linkerd.io/checks/#l5d-version-cli for hints                                                             

linkerd-data-plane                                                                                                       
------------------                                                                                                       
√ data plane namespace exists                                                                                            
√ data plane proxies are ready                                                                                           
√ data plane proxy metrics are present in Prometheus                                                                     
‼ data plane is up-to-date                                                                                               
    commandrelay/commandrelay-dpl-88f47658f-7j49k: is running version 20.9.1 but the latest edge version is 20.9.2       
    see https://linkerd.io/checks/#l5d-data-plane-version for hints                                                      
‼ data plane and cli versions match                                                                                      
    commandrelay/commandrelay-dpl-88f47658f-2h8jf running edge-20.9.1 but cli running stable-2.7.1                       
    see https://linkerd.io/checks/#l5d-data-plane-cli-version for hints                                                  

linkerd-ha-checks                                                                                                        
-----------------                                                                                                        
√ pod injection disabled on kube-system                                                                                  

Status check results are √                                                                                               

Environment

Possible solution

l think it can be linkerd-proxy containers crashing and not being restarted by their liveliness probes. This resulted in broken connectivity for the service container and could result in the service container going into a crash loop. Can it be related to linkerd livenss prob endpoint is changed on edge-20.4.4 release?

To mitigate this, we restarted linked-proxy container and pod comes out of crash loop. kubectl exec -it appxyz-dpl-88f47658f-dh6ct -n appxyz -c linkerd-proxy -- /bin/sh -c "kill 1"

Additional context

zaharidichev commented 3 years ago

@Abrishges I tried to reproduce the problem and could not. You are right that the endpoints changed. This should not be a problem though because the pods get reinjected with the the correct endpoints specified. Any of thesefailing pods been injected by using the --manual option ?

Abrishges commented 3 years ago

@zaharidichev proxies are auto injected. I found this proxy log from crash looped pod

time="2020-09-11T18:57:53Z" level=info msg="running version edge-20.9.1"
thread 'main' panicked at 'inbound failed: Transport endpoint is not connected (os error 107)', /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/macros.rs:16:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
appabc-dpl-54bdd869df-bvlc5   1/2     CrashLoopBackOff   636        4d22h
appabc-dpl-54bdd869df-dm2wg   2/2     Running            0          4d22h
appabc-dpl-54bdd869df-fmdfl   1/2     CrashLoopBackOff   969        3d20h
appabc-dpl-54bdd869df-gv7wr   1/2     Running            672        4d22h
appabc-dpl-54bdd869df-vv9pm   2/2     Running            0          3d20h
Abrishges commented 3 years ago

@zaharidichev @ihcsim On release edge-2.4.4 , the Linkerd proxy endpoint for liveness is changed to use the new /live endpoint instead of the /metrics endpoint. However on edge-release I am using (edge-2.9.1), kubectl describe and logs from debugging sidecar shows, it still use the old livenesss /metrics endpoint

 kubectl get pods -n appabc
NAME                                 READY   STATUS    RESTARTS   AGE
appabc-dpl-85bfc86c66-lf4w5   2/3     Running   10         104m
    Container ID:   docker://8ac3b19430c2150473cb4ba4bb925861032ca2471f64e7c01d9050f7cee5fa60
    Image:          xcldprodwus2.azurecr.io/services.appabc:1.0.2009.1101
    Image ID:       docker-pullable://xcldprodwus2.azurecr.io/services.appabc@sha256:a7261a6a980817ada39f0dfbc4ec4a7b0999af1aecdca4018972371594222748
    Ports:          443/TCP, 10443/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 15 Sep 2020 12:40:21 -0700
      Finished:     Tue, 15 Sep 2020 12:42:10 -0700
    Ready:          False
    Restart Count:  15

 linkerd-debug:                                                                                                                         
   Container ID:   docker://b80905fc18e4fec0a82fed929c1bc34a9d509f157a903ed7c7ebe8ea7abd08af                                          
   Image:          gcr.io/linkerd-io/debug:edge-20.9.1                                                                                
   Image ID:       docker-pullable://gcr.io/linkerd-io/debug@sha256:8750d53d6470abf98f063b0dbd73c122bac6e780f7078bd3f6c7610c71972c57  
   Port:           <none>                                                                                                             
   Host Port:      <none>                                                                                                             
   State:          Running                                                                                                            
     Started:      Tue, 15 Sep 2020 10:36:52 -0700                                                                                    
   Ready:          True                                                                                                               
   Restart Count:  0                                                                                                                  
   Environment:    <none>                                                                                                             
   Mounts:                                                                                                                            
     /var/run/secrets/kubernetes.io/serviceaccount from default-token-md6mc (ro)                                                      
 linkerd-proxy:
    Container ID:   docker://f5d2cdd227963acf8975c11444f28c45dda34d42f06165eaf59fa6908acf6f4e
    Image:          gcr.io/linkerd-io/proxy:edge-20.9.1
    Image ID:       docker-pullable://gcr.io/linkerd-io/proxy@sha256:fb0e27e7be6d5464f8a719e5ff73b32e3f97f304b39aecf54f76b9cebcac6752
    Ports:          4143/TCP, 4191/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Tue, 15 Sep 2020 10:36:53 -0700
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     125m
      memory:  128Mi
    Requests:
      cpu:      125m
      memory:   128Mi
    Liveness:   http-get http://:4191/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:4191/ready delay=2s timeout=1s period=10s #success=1 #failure=3                                                                                                                  

Logs from debugging sidecar

I also don't see any readness prob call in debugging sidecar logs, is that expected ?

            [Group: Sequence]
        Request Method: GET
        Request URI: /metrics
        Request Version: HTTP/1.1
    host: 172.21.14.33:4191\r\n
    user-agent: Prometheus/2.15.2\r\n
    accept: application/openmetrics-text; version=0.0.1,text/plain;version=0.0.4;q=0.5,*/*;q=0.1\r\n
    accept-encoding: gzip\r\n
    x-prometheus-scrape-timeout-seconds: 15.000000\r\n
    l5d-dst-canonical: 172.21.14.33:4191\r\n
    \r\n
    [Full request URI: http://172.21.14.33:4191/metrics]
    [HTTP request 15/15]
    [Prev request in frame: 2006]
Frame 2044: 189 bytes on wire (1512 bits), 189 bytes captured (1512 bits) on interface 0
    Interface id: 0 (any)
        Interface name: any
    Encapsulation type: Linux cooked-mode capture (25)
    Arrival Time: Sep 15, 2020 19:36:39.775511034 UTC
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1600198599.775511034 seconds
    [Time delta from previous captured frame: 0.000205298 seconds]
    [Time delta from previous displayed frame: 0.832482781 seconds]
    [Time since reference or first frame: 216.949834200 seconds]
    Frame Number: 2044
    Frame Length: 189 bytes (1512 bits)
    Capture Length: 189 bytes (1512 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: sll:ethertype:ip:tcp:http]
Linux cooked capture
    Packet type: Unicast to us (0)
    Link-layer address type: 1
    Link-layer address length: 6
    Source: Microsof_c3:84:f3 (00:0d:3a:c3:84:f3)
    Unused: 0000
    Protocol: IPv4 (0x0800)
Internet Protocol Version 4, Src: 172.21.14.29, Dst: 172.21.14.33
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
        0000 00.. = Differentiated Services Codepoint: Default (0)
        .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
    Total Length: 173
    Identification: 0x5c4a (23626)
    Flags: 0x4000, Don't fragment
        0... .... .... .... = Reserved bit: Not set
        .1.. .... .... .... = Don't fragment: Set
        ..0. .... .... .... = More fragments: Not set
        ...0 0000 0000 0000 = Fragment offset: 0
    Time to live: 64
    Protocol: TCP (6)
    Header checksum: 0x6998 [validation disabled]
    [Header checksum status: Unverified]
    Source: 172.21.14.29
    Destination: 172.21.14.33
Transmission Control Protocol, Src Port: 43276, Dst Port: 4191, Seq: 1, Ack: 1, Len: 121
    Source Port: 43276
    Destination Port: 4191
    [Stream index: 173]
    [TCP Segment Len: 121]
    Sequence number: 1    (relative sequence number)
    [Next sequence number: 122    (relative sequence number)]
    Acknowledgment number: 1    (relative ack number)
    1000 .... = Header Length: 32 bytes (8)
    Flags: 0x018 (PSH, ACK)
        000. .... .... = Reserved: Not set
        ...0 .... .... = Nonce: Not set
        .... 0... .... = Congestion Window Reduced (CWR): Not set
        .... .0.. .... = ECN-Echo: Not set
        .... ..0. .... = Urgent: Not set
        .... ...1 .... = Acknowledgment: Set
        .... .... 1... = Push: Set
        .... .... .0.. = Reset: Not set
        .... .... ..0. = Syn: Not set
        .... .... ...0 = Fin: Not set
        [TCP Flags: ???????AP???]
    Window size value: 126
    [Calculated window size: 64512]
    [Window size scaling factor: 512]
    Checksum: 0x7508 [unverified]
    [Checksum Status: Unverified]
    Urgent pointer: 0
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
        TCP Option - No-Operation (NOP)
            Kind: No-Operation (1)
        TCP Option - No-Operation (NOP)
            Kind: No-Operation (1)
        TCP Option - Timestamps: TSval 67758210, TSecr 2808764296
            Kind: Time Stamp Option (8)
            Length: 10
            Timestamp value: 67758210
            Timestamp echo reply: 2808764296
    [SEQ/ACK analysis]
        [iRTT: 0.000044100 seconds]
        [Bytes in flight: 121]
        [Bytes sent since last PSH flag: 121]
    [Timestamps]
        [Time since first frame in this TCP stream: 0.000249398 seconds]
        [Time since previous frame in this TCP stream: 0.000205298 seconds]
    TCP payload (121 bytes)
Hypertext Transfer Protocol
    GET /metrics HTTP/1.1\r\n
        [Expert Info (Chat/Sequence): GET /metrics HTTP/1.1\r\n]
            [GET /metrics HTTP/1.1\r\n]
            [Severity level: Chat]
            [Group: Sequence]
        Request Method: GET
        Request URI: /metrics
        Request Version: HTTP/1.1
    Host: 172.21.14.33:4191\r\n
    User-Agent: kube-probe/1.16\r\n
zaharidichev commented 3 years ago

Alright. This is quite strange to me. Your proxy image is the correct one, and on this version of linkerd the proxy template is with the correct endpoint. Somehow though your Liveness endpoint is still pointing to /metrics. I wonder how is that possible. Did you wait for the control plane components upgrade to fully finish before you rolled your application deployments? Could it be the case that you have two versions of the injector running somehow? @ihcsim Any other ideas?

Abrishges commented 3 years ago

@zaharidichev Linkerd components upgrade fully completed a day before and pods start to crash loop during deployment or during pod restart. The proxy from linkerd seems to fail to start for some weird OS panic issue (OS error 107). any idea why this is happening ?

zaharidichev commented 3 years ago

@Abrishges I really do not have any concrete idea why this is happening. It might be environment related, the upgrade process might have gone wrong for some reason... Nevertheless if we do not have concrete reoproduction, it is impossible for us to figure out what the problem is.

Abrishges commented 3 years ago

@zaharidichev The proxy container log seems shows OS error 107. The error seems related to the rust portion of linkerd.

kubectl logs auth-dpl-5b7cd8bd7-h8nl9 -c linkerd-proxy -n auth time="2020-09-17T11:26:54Z" level=info msg="running version edge-20.9.1" thread 'main' panicked at 'inbound failed: Transport endpoint is not connected (os error 107)', /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/macros.rs:16:9 note: run withRUST_BACKTRACE=1environment variable to display a backtrace

ihcsim commented 3 years ago

@Abrishges The kubectl describe output you posted at https://github.com/linkerd/linkerd2/issues/4973#issuecomment-692949190 suggests that the proxy upgrade didn't complete as expected. The liveness probe endpoint should have been updated to /live. Did you restart your workload with something like kubectl rollout restart? I just want to make sure that there isn't something like an config.linkerd.io/proxy-version annotation in your workloads or Linkerd control plane that ends up only upgrading the proxy container image, but not the latest resource YAML.

Also, make sure that your proxy injector is already upgraded to edge-20.9.1 before re-injecting the new proxy into your workload. The proxy injector's logs should also show events where workloads are successfully re-injected with the new proxy.

Abrishges commented 3 years ago

@ihcsim @zaharidichev Sorry my bad, the out put from kubectl describe taken after I rollback linkerd edge release due to pod crashloopbackoff while the pods are on the edge release. Yesterday I run another test in our Stress cluster. I have confirmand all the control plan and proxies are on edge 2.9.1. Liveness endpoint also points to the right endpoint /live , pod crash loop continue to happen randomly for few of our services

 $linkerd version
Client version: edge-20.9.1
Server version: edge-20.9.1
xCloud proxy version appabc
NameSpace      POD_NAME                           PROXY
appabc   appabc-dpl-88f47658f-2ht66   edge-20.9.1
appabc   appabc-dpl-88f47658f-57xnc   edge-20.9.1
appabc   appabc-dpl-88f47658f-6xsvt   edge-20.9.1
appabc   appabc-dpl-88f47658f-c6t2x   edge-20.9.1
appabc   appabc-dpl-88f47658f-gbsgj   edge-20.9.1
appabc   appabc-dpl-88f47658f-k8hq9   edge-20.9.1
appabc   appabc-dpl-88f47658f-mp2sb   edge-20.9.1
appabc   appabc-dpl-88f47658f-ndjf4   edge-20.9.1
appabc   appabc-dpl-88f47658f-v2xb5   edge-20.9.1
appabc   appabc-dpl-88f47658f-v44np   edge-20.9.1
appabc   appabc-dpl-88f47658f-vrj4c   edge-20.9.1

CrashloopBackOff happened randomly once in a while and logs from the proxy shows the same error. The proxy container log seems shows OS error 107. The error seems related to the rust portion of linkerd.

 kubectl logs appabc-dpl-54bdd869df-fmdfl -c linkerd-proxy -n appabc
time="2020-09-11T18:57:53Z" level=info msg="running version edge-20.9.1"
thread 'main' panicked at 'inbound failed: Transport endpoint is not connected (os error 107)', /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/macros.rs:16:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Abrishges commented 3 years ago

CrashloopBackOff pods from today. I also noticed crash loop for services with load balancer (external IP)

kubectl get pods --all-namespaces |sls "crashloop"

appabc          appabc-dpl-565f4d7749-6wrk5                 1/2     CrashLoopBackOff   41                    32h
appabc      appabc-dpl-c798876f8-fndr9                        1/2     CrashLoopBackOff   187                  21h

The logs from the proxies is the same

2020-09-22T10:16:27.860550589Z time="2020-09-22T10:16:27Z" level=info msg="running version edge-20.9.1"
2020-09-23T16:08:57.587533356Z thread 'main' panicked at 'inbound failed: Transport endpoint is not connected (os error 107)', /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/macros.rs:16:9
2020-09-23T16:08:57.587563756Z note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Abrishges commented 3 years ago

@zaharidichev I changed my log level to debug mode and found the following error. Any idea or suggestion about Connection reset by peer (os error 104)?

kubectl get pods -n appabc -o wide |sls "crash"
appabc-dpl-766fd87c4b-nkfrx   1/2     CrashLoopBackOff   557        2d8h   172.21.13.250 

Proxy logs

kubectl logs appabc-dpl-766fd87c4b-nkfrx -c linkerd-proxy -n appabc --timestamps --tail 10
2020-09-24T16:22:49.132838685Z {"timestamp":"[202307.811706416s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:34354","name":"accept"}],"span":{"peer.addr":"172.21.13.250:34354","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:22:49.133266289Z {"timestamp":"[202307.812194021s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:47948","name":"accept"}],"span":{"peer.addr":"172.21.13.250:47948","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:29:50.742036662Z {"timestamp":"[202729.420903794s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:57222","name":"accept"}],"span":{"peer.addr":"172.21.13.250:57222","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:29:50.742617768Z {"timestamp":"[202729.421541300s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:48620","name":"accept"}],"span":{"peer.addr":"172.21.13.250:48620","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:31:41.132704436Z {"timestamp":"[202839.811467867s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:46390","name":"accept"}],"span":{"peer.addr":"172.21.13.250:46390","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:31:41.133148240Z {"timestamp":"[202839.812044372s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:43416","name":"accept"}],"span":{"peer.addr":"172.21.13.250:43416","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:38:38.721208212Z {"timestamp":"[203257.400047843s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:48308","name":"accept"}],"span":{"peer.addr":"172.21.13.250:48308","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:38:38.721500114Z {"timestamp":"[203257.400457847s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:33752","name":"accept"}],"span":{"peer.addr":"172.21.13.250:33752","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:40:19.327763927Z {"timestamp":"[203358.6597758s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:56950","name":"accept"}],"span":{"peer.addr":"172.21.13.250:56950","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:40:19.328254532Z {"timestamp":"[203358.7187264s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:40416","name":"accept"}],"span":{"peer.addr":"172.21.13.250:40416","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}

Application container log

 kubectl logs appabc-dpl-766fd87c4b-nkfrx -c appabc-ctr -n appabc --timestamps --tail 10       
2020-09-24T16:57:31.229300335Z {"ver":"2.1","name":"GsLogging.KeyVaultConfigSourceConfigEvent","time":"2020-09-24T16:57:30.5172972Z","epoch
:"1","seqNum":2,"os":"Unix","osVer":"4.15.0.1096","ext":{"cloud":{"ver":"1.1","name":"Microsoft.GameStreaming.Services.Common.WebApi","role
:"Unknown","roleInstance":"appabc-dpl-766fd87c4b-nkfrx","location":"Unknown","environment":"Unknown"},"sll":{"libVer":"5.2.38","level
:3}},"data":{"baseType":"Ms.CorrelationContext.CorrelationContext","baseData":{},"appId":"dddfaf4f-d6e9-467a-a702-ddd963ccdb72","appCertPat
":"/etc/certs/kvspicert.pfx","keyVaultBaseUrl":"https://gssv-msft-strs-base-regs.vault.azure.net/"}}                                       
2020-09-24T16:57:31.229358736Z {"ver":"2.1","name":"GsLogging.KeyVaultConfigSourceConfigEvent","time":"2020-09-24T16:57:30.7296887Z","epoch
:"1","seqNum":3,"os":"Unix","osVer":"4.15.0.1096","ext":{"cloud":{"ver":"1.1","name":"Microsoft.GameStreaming.Services.Common.WebApi","role
:"Unknown","roleInstance":"appabc-dpl-766fd87c4b-nkfrx","location":"Unknown","environment":"Unknown"},"sll":{"libVer":"5.2.38","level
:3}},"data":{"baseType":"Ms.CorrelationContext.CorrelationContext","baseData":{},"appId":"dddfaf4f-d6e9-467a-a702-ddd963ccdb72","appCertPat
":"/etc/certs/kvspicert.pfx","keyVaultBaseUrl":"https://gssv-msft-strs-wus2-regs.vault.azure.net/"}}                                       
2020-09-24T16:57:41.226619739Z {"ver":"2.1","name":"CommonWebApi.TlsConfigEvent","time":"2020-09-24T16:57:41.1562315Z","epoch":"1","seqNum"
5,"os":"Unix","osVer":"4.15.0.1096","ext":{"cloud":{"ver":"1.1","name":"appabc","role":"appabcfd","roleInstance":"appabc-
pl-766fd87c4b-nkfrx","location":"westus2","environment":"Stress"},"sll":{"libVer":"5.2.38","level":3}},"data":{"baseType":"Ms.CorrelationCo
text.CorrelationContext","baseData":{},"message":"Setting up HTTPS configuration with the specified TLS certificate.","tlsCertName":"Gssv-R
gistration-Endpoint-Cert","servicePort":"443"}}                                                                                            
2020-09-24T16:57:42.319642984Z Hosting environment: Stress                                                                                 
2020-09-24T16:57:42.319683884Z Content root path: /app                                                                                     
2020-09-24T16:57:42.319690784Z Now listening on: https://0.0.0.0:10443                                                                     
2020-09-24T16:57:42.319694284Z Now listening on: https://0.0.0.0:443                                                                       
2020-09-24T16:57:42.319697984Z Application started. Press Ctrl+C to shut down.                                                             
2020-09-24T16:57:43.228887888Z {"ver":"2.1","name":"CommonWebApi.ApplicationLifetimeEvent","time":"2020-09-24T16:57:42.2213163Z","epoch":"1
,"seqNum":7,"os":"Unix","osVer":"4.15.0.1096","ext":{"cloud":{"ver":"1.1","name":"appabc","role":"appabcfd","roleInstance":"reg
stration-dpl-766fd87c4b-nkfrx","location":"westus2","environment":"Stress"},"sll":{"libVer":"5.2.38","level":3}},"data":{"baseType":"Ms.Cor
elationContext.CorrelationContext","baseData":{},"message":"Application started"}}                                                         
2020-09-24T16:57:43.228921889Z {"ver":"2.1","name":"HttpClientLoggingMessageHandlerEvents.ExternalRequestCompleted","time":"2020-09-24T16:5
:42.3246283Z","epoch":"1","seqNum":9,"os":"Unix","osVer":"4.15.0.1096","cV":"Ix4s+C5b9UW/wV9zafOq1w.1","ext":{"cloud":{"ver":"1.1","name":"
egistration","role":"appabcfd","roleInstance":"appabc-dpl-766fd87c4b-nkfrx","location":"westus2","environment":"Stress"},"sll":
"libVer":"5.2.38","level":3}},"data":{"baseType":"Ms.Qos.OutgoingServiceRequest","baseData":{"operationName":"ExternalRequest","targetUri":
https://auth-svc.auth:10443/v1/keys","latencyMs":195,"succeeded":true,"requestMethod":"GET","protocol":"HTTP","protocolStatusCode":"OK","de
endencyOperationName":"JsonWebKeyResolver","dependencyName":"auth-svc.auth","requestStatus":3},"friendlyName":"JsonWebKeyResolver","message
:"GET https://auth-svc.auth:10443/v1/keys : OK (195 ms)"}}                                                                                 
stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.