Closed Abrishges closed 3 years ago
@Abrishges I tried to reproduce the problem and could not. You are right that the endpoints changed. This should not be a problem though because the pods get reinjected with the the correct endpoints specified. Any of thesefailing pods been injected by using the --manual
option ?
@zaharidichev proxies are auto injected. I found this proxy log from crash looped pod
time="2020-09-11T18:57:53Z" level=info msg="running version edge-20.9.1"
thread 'main' panicked at 'inbound failed: Transport endpoint is not connected (os error 107)', /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/macros.rs:16:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
appabc-dpl-54bdd869df-bvlc5 1/2 CrashLoopBackOff 636 4d22h
appabc-dpl-54bdd869df-dm2wg 2/2 Running 0 4d22h
appabc-dpl-54bdd869df-fmdfl 1/2 CrashLoopBackOff 969 3d20h
appabc-dpl-54bdd869df-gv7wr 1/2 Running 672 4d22h
appabc-dpl-54bdd869df-vv9pm 2/2 Running 0 3d20h
@zaharidichev @ihcsim On release edge-2.4.4 , the Linkerd proxy endpoint for liveness is changed to use the new /live endpoint instead of the /metrics endpoint. However on edge-release I am using (edge-2.9.1), kubectl describe and logs from debugging sidecar shows, it still use the old livenesss /metrics endpoint
kubectl get pods -n appabc
NAME READY STATUS RESTARTS AGE
appabc-dpl-85bfc86c66-lf4w5 2/3 Running 10 104m
Container ID: docker://8ac3b19430c2150473cb4ba4bb925861032ca2471f64e7c01d9050f7cee5fa60
Image: xcldprodwus2.azurecr.io/services.appabc:1.0.2009.1101
Image ID: docker-pullable://xcldprodwus2.azurecr.io/services.appabc@sha256:a7261a6a980817ada39f0dfbc4ec4a7b0999af1aecdca4018972371594222748
Ports: 443/TCP, 10443/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 15 Sep 2020 12:40:21 -0700
Finished: Tue, 15 Sep 2020 12:42:10 -0700
Ready: False
Restart Count: 15
linkerd-debug:
Container ID: docker://b80905fc18e4fec0a82fed929c1bc34a9d509f157a903ed7c7ebe8ea7abd08af
Image: gcr.io/linkerd-io/debug:edge-20.9.1
Image ID: docker-pullable://gcr.io/linkerd-io/debug@sha256:8750d53d6470abf98f063b0dbd73c122bac6e780f7078bd3f6c7610c71972c57
Port: <none>
Host Port: <none>
State: Running
Started: Tue, 15 Sep 2020 10:36:52 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-md6mc (ro)
linkerd-proxy:
Container ID: docker://f5d2cdd227963acf8975c11444f28c45dda34d42f06165eaf59fa6908acf6f4e
Image: gcr.io/linkerd-io/proxy:edge-20.9.1
Image ID: docker-pullable://gcr.io/linkerd-io/proxy@sha256:fb0e27e7be6d5464f8a719e5ff73b32e3f97f304b39aecf54f76b9cebcac6752
Ports: 4143/TCP, 4191/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Tue, 15 Sep 2020 10:36:53 -0700
Ready: True
Restart Count: 0
Limits:
cpu: 125m
memory: 128Mi
Requests:
cpu: 125m
memory: 128Mi
Liveness: http-get http://:4191/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:4191/ready delay=2s timeout=1s period=10s #success=1 #failure=3
I also don't see any readness prob call in debugging sidecar logs, is that expected ?
[Group: Sequence]
Request Method: GET
Request URI: /metrics
Request Version: HTTP/1.1
host: 172.21.14.33:4191\r\n
user-agent: Prometheus/2.15.2\r\n
accept: application/openmetrics-text; version=0.0.1,text/plain;version=0.0.4;q=0.5,*/*;q=0.1\r\n
accept-encoding: gzip\r\n
x-prometheus-scrape-timeout-seconds: 15.000000\r\n
l5d-dst-canonical: 172.21.14.33:4191\r\n
\r\n
[Full request URI: http://172.21.14.33:4191/metrics]
[HTTP request 15/15]
[Prev request in frame: 2006]
Frame 2044: 189 bytes on wire (1512 bits), 189 bytes captured (1512 bits) on interface 0
Interface id: 0 (any)
Interface name: any
Encapsulation type: Linux cooked-mode capture (25)
Arrival Time: Sep 15, 2020 19:36:39.775511034 UTC
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1600198599.775511034 seconds
[Time delta from previous captured frame: 0.000205298 seconds]
[Time delta from previous displayed frame: 0.832482781 seconds]
[Time since reference or first frame: 216.949834200 seconds]
Frame Number: 2044
Frame Length: 189 bytes (1512 bits)
Capture Length: 189 bytes (1512 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: sll:ethertype:ip:tcp:http]
Linux cooked capture
Packet type: Unicast to us (0)
Link-layer address type: 1
Link-layer address length: 6
Source: Microsof_c3:84:f3 (00:0d:3a:c3:84:f3)
Unused: 0000
Protocol: IPv4 (0x0800)
Internet Protocol Version 4, Src: 172.21.14.29, Dst: 172.21.14.33
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
0000 00.. = Differentiated Services Codepoint: Default (0)
.... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
Total Length: 173
Identification: 0x5c4a (23626)
Flags: 0x4000, Don't fragment
0... .... .... .... = Reserved bit: Not set
.1.. .... .... .... = Don't fragment: Set
..0. .... .... .... = More fragments: Not set
...0 0000 0000 0000 = Fragment offset: 0
Time to live: 64
Protocol: TCP (6)
Header checksum: 0x6998 [validation disabled]
[Header checksum status: Unverified]
Source: 172.21.14.29
Destination: 172.21.14.33
Transmission Control Protocol, Src Port: 43276, Dst Port: 4191, Seq: 1, Ack: 1, Len: 121
Source Port: 43276
Destination Port: 4191
[Stream index: 173]
[TCP Segment Len: 121]
Sequence number: 1 (relative sequence number)
[Next sequence number: 122 (relative sequence number)]
Acknowledgment number: 1 (relative ack number)
1000 .... = Header Length: 32 bytes (8)
Flags: 0x018 (PSH, ACK)
000. .... .... = Reserved: Not set
...0 .... .... = Nonce: Not set
.... 0... .... = Congestion Window Reduced (CWR): Not set
.... .0.. .... = ECN-Echo: Not set
.... ..0. .... = Urgent: Not set
.... ...1 .... = Acknowledgment: Set
.... .... 1... = Push: Set
.... .... .0.. = Reset: Not set
.... .... ..0. = Syn: Not set
.... .... ...0 = Fin: Not set
[TCP Flags: ???????AP???]
Window size value: 126
[Calculated window size: 64512]
[Window size scaling factor: 512]
Checksum: 0x7508 [unverified]
[Checksum Status: Unverified]
Urgent pointer: 0
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
TCP Option - No-Operation (NOP)
Kind: No-Operation (1)
TCP Option - No-Operation (NOP)
Kind: No-Operation (1)
TCP Option - Timestamps: TSval 67758210, TSecr 2808764296
Kind: Time Stamp Option (8)
Length: 10
Timestamp value: 67758210
Timestamp echo reply: 2808764296
[SEQ/ACK analysis]
[iRTT: 0.000044100 seconds]
[Bytes in flight: 121]
[Bytes sent since last PSH flag: 121]
[Timestamps]
[Time since first frame in this TCP stream: 0.000249398 seconds]
[Time since previous frame in this TCP stream: 0.000205298 seconds]
TCP payload (121 bytes)
Hypertext Transfer Protocol
GET /metrics HTTP/1.1\r\n
[Expert Info (Chat/Sequence): GET /metrics HTTP/1.1\r\n]
[GET /metrics HTTP/1.1\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: GET
Request URI: /metrics
Request Version: HTTP/1.1
Host: 172.21.14.33:4191\r\n
User-Agent: kube-probe/1.16\r\n
Alright. This is quite strange to me. Your proxy image is the correct one, and on this version of linkerd the proxy template is with the correct endpoint. Somehow though your Liveness endpoint is still pointing to /metrics
. I wonder how is that possible. Did you wait for the control plane components upgrade to fully finish before you rolled your application deployments? Could it be the case that you have two versions of the injector running somehow? @ihcsim Any other ideas?
@zaharidichev Linkerd components upgrade fully completed a day before and pods start to crash loop during deployment or during pod restart. The proxy from linkerd seems to fail to start for some weird OS panic issue (OS error 107). any idea why this is happening ?
@Abrishges I really do not have any concrete idea why this is happening. It might be environment related, the upgrade process might have gone wrong for some reason... Nevertheless if we do not have concrete reoproduction, it is impossible for us to figure out what the problem is.
@zaharidichev The proxy container log seems shows OS error 107. The error seems related to the rust portion of linkerd.
kubectl logs auth-dpl-5b7cd8bd7-h8nl9 -c linkerd-proxy -n auth time="2020-09-17T11:26:54Z" level=info msg="running version edge-20.9.1" thread 'main' panicked at 'inbound failed: Transport endpoint is not connected (os error 107)', /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/macros.rs:16:9 note: run with
RUST_BACKTRACE=1environment variable to display a backtrace
@Abrishges The kubectl describe
output you posted at https://github.com/linkerd/linkerd2/issues/4973#issuecomment-692949190 suggests that the proxy upgrade didn't complete as expected. The liveness probe endpoint should have been updated to /live
. Did you restart your workload with something like kubectl rollout restart
? I just want to make sure that there isn't something like an config.linkerd.io/proxy-version
annotation in your workloads or Linkerd control plane that ends up only upgrading the proxy container image, but not the latest resource YAML.
Also, make sure that your proxy injector is already upgraded to edge-20.9.1 before re-injecting the new proxy into your workload. The proxy injector's logs should also show events where workloads are successfully re-injected with the new proxy.
@ihcsim @zaharidichev Sorry my bad, the out put from kubectl describe
taken after I rollback linkerd edge release due to pod crashloopbackoff while the pods are on the edge release. Yesterday I run another test in our Stress cluster. I have confirmand all the control plan and proxies are on edge 2.9.1. Liveness endpoint also points to the right endpoint /live
, pod crash loop continue to happen randomly for few of our services
$linkerd version
Client version: edge-20.9.1
Server version: edge-20.9.1
xCloud proxy version appabc
NameSpace POD_NAME PROXY
appabc appabc-dpl-88f47658f-2ht66 edge-20.9.1
appabc appabc-dpl-88f47658f-57xnc edge-20.9.1
appabc appabc-dpl-88f47658f-6xsvt edge-20.9.1
appabc appabc-dpl-88f47658f-c6t2x edge-20.9.1
appabc appabc-dpl-88f47658f-gbsgj edge-20.9.1
appabc appabc-dpl-88f47658f-k8hq9 edge-20.9.1
appabc appabc-dpl-88f47658f-mp2sb edge-20.9.1
appabc appabc-dpl-88f47658f-ndjf4 edge-20.9.1
appabc appabc-dpl-88f47658f-v2xb5 edge-20.9.1
appabc appabc-dpl-88f47658f-v44np edge-20.9.1
appabc appabc-dpl-88f47658f-vrj4c edge-20.9.1
CrashloopBackOff happened randomly once in a while and logs from the proxy shows the same error. The proxy container log seems shows OS error 107. The error seems related to the rust portion of linkerd.
kubectl logs appabc-dpl-54bdd869df-fmdfl -c linkerd-proxy -n appabc
time="2020-09-11T18:57:53Z" level=info msg="running version edge-20.9.1"
thread 'main' panicked at 'inbound failed: Transport endpoint is not connected (os error 107)', /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/macros.rs:16:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
CrashloopBackOff pods from today. I also noticed crash loop for services with load balancer (external IP)
kubectl get pods --all-namespaces |sls "crashloop"
appabc appabc-dpl-565f4d7749-6wrk5 1/2 CrashLoopBackOff 41 32h
appabc appabc-dpl-c798876f8-fndr9 1/2 CrashLoopBackOff 187 21h
The logs from the proxies is the same
2020-09-22T10:16:27.860550589Z time="2020-09-22T10:16:27Z" level=info msg="running version edge-20.9.1"
2020-09-23T16:08:57.587533356Z thread 'main' panicked at 'inbound failed: Transport endpoint is not connected (os error 107)', /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/macros.rs:16:9
2020-09-23T16:08:57.587563756Z note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
@zaharidichev I changed my log level to debug
mode and found the following error. Any idea or suggestion about Connection reset by peer (os error 104)?
kubectl get pods -n appabc -o wide |sls "crash"
appabc-dpl-766fd87c4b-nkfrx 1/2 CrashLoopBackOff 557 2d8h 172.21.13.250
kubectl logs appabc-dpl-766fd87c4b-nkfrx -c linkerd-proxy -n appabc --timestamps --tail 10
2020-09-24T16:22:49.132838685Z {"timestamp":"[202307.811706416s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:34354","name":"accept"}],"span":{"peer.addr":"172.21.13.250:34354","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:22:49.133266289Z {"timestamp":"[202307.812194021s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:47948","name":"accept"}],"span":{"peer.addr":"172.21.13.250:47948","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:29:50.742036662Z {"timestamp":"[202729.420903794s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:57222","name":"accept"}],"span":{"peer.addr":"172.21.13.250:57222","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:29:50.742617768Z {"timestamp":"[202729.421541300s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:48620","name":"accept"}],"span":{"peer.addr":"172.21.13.250:48620","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:31:41.132704436Z {"timestamp":"[202839.811467867s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:46390","name":"accept"}],"span":{"peer.addr":"172.21.13.250:46390","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:31:41.133148240Z {"timestamp":"[202839.812044372s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:43416","name":"accept"}],"span":{"peer.addr":"172.21.13.250:43416","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:38:38.721208212Z {"timestamp":"[203257.400047843s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:48308","name":"accept"}],"span":{"peer.addr":"172.21.13.250:48308","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:38:38.721500114Z {"timestamp":"[203257.400457847s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:33752","name":"accept"}],"span":{"peer.addr":"172.21.13.250:33752","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:40:19.327763927Z {"timestamp":"[203358.6597758s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:56950","name":"accept"}],"span":{"peer.addr":"172.21.13.250:56950","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
2020-09-24T16:40:19.328254532Z {"timestamp":"[203358.7187264s]","level":"INFO","spans":[{"name":"outbound"},{"peer.addr":"172.21.13.250:40416","name":"accept"}],"span":{"peer.addr":"172.21.13.250:40416","name":"accept"},"target":"linkerd2_app_core::serve","threadId":"ThreadId(1)","fields":{"message":"Connection closed","error":"Connection reset by peer (os error 104)"}}
kubectl logs appabc-dpl-766fd87c4b-nkfrx -c appabc-ctr -n appabc --timestamps --tail 10
2020-09-24T16:57:31.229300335Z {"ver":"2.1","name":"GsLogging.KeyVaultConfigSourceConfigEvent","time":"2020-09-24T16:57:30.5172972Z","epoch
:"1","seqNum":2,"os":"Unix","osVer":"4.15.0.1096","ext":{"cloud":{"ver":"1.1","name":"Microsoft.GameStreaming.Services.Common.WebApi","role
:"Unknown","roleInstance":"appabc-dpl-766fd87c4b-nkfrx","location":"Unknown","environment":"Unknown"},"sll":{"libVer":"5.2.38","level
:3}},"data":{"baseType":"Ms.CorrelationContext.CorrelationContext","baseData":{},"appId":"dddfaf4f-d6e9-467a-a702-ddd963ccdb72","appCertPat
":"/etc/certs/kvspicert.pfx","keyVaultBaseUrl":"https://gssv-msft-strs-base-regs.vault.azure.net/"}}
2020-09-24T16:57:31.229358736Z {"ver":"2.1","name":"GsLogging.KeyVaultConfigSourceConfigEvent","time":"2020-09-24T16:57:30.7296887Z","epoch
:"1","seqNum":3,"os":"Unix","osVer":"4.15.0.1096","ext":{"cloud":{"ver":"1.1","name":"Microsoft.GameStreaming.Services.Common.WebApi","role
:"Unknown","roleInstance":"appabc-dpl-766fd87c4b-nkfrx","location":"Unknown","environment":"Unknown"},"sll":{"libVer":"5.2.38","level
:3}},"data":{"baseType":"Ms.CorrelationContext.CorrelationContext","baseData":{},"appId":"dddfaf4f-d6e9-467a-a702-ddd963ccdb72","appCertPat
":"/etc/certs/kvspicert.pfx","keyVaultBaseUrl":"https://gssv-msft-strs-wus2-regs.vault.azure.net/"}}
2020-09-24T16:57:41.226619739Z {"ver":"2.1","name":"CommonWebApi.TlsConfigEvent","time":"2020-09-24T16:57:41.1562315Z","epoch":"1","seqNum"
5,"os":"Unix","osVer":"4.15.0.1096","ext":{"cloud":{"ver":"1.1","name":"appabc","role":"appabcfd","roleInstance":"appabc-
pl-766fd87c4b-nkfrx","location":"westus2","environment":"Stress"},"sll":{"libVer":"5.2.38","level":3}},"data":{"baseType":"Ms.CorrelationCo
text.CorrelationContext","baseData":{},"message":"Setting up HTTPS configuration with the specified TLS certificate.","tlsCertName":"Gssv-R
gistration-Endpoint-Cert","servicePort":"443"}}
2020-09-24T16:57:42.319642984Z Hosting environment: Stress
2020-09-24T16:57:42.319683884Z Content root path: /app
2020-09-24T16:57:42.319690784Z Now listening on: https://0.0.0.0:10443
2020-09-24T16:57:42.319694284Z Now listening on: https://0.0.0.0:443
2020-09-24T16:57:42.319697984Z Application started. Press Ctrl+C to shut down.
2020-09-24T16:57:43.228887888Z {"ver":"2.1","name":"CommonWebApi.ApplicationLifetimeEvent","time":"2020-09-24T16:57:42.2213163Z","epoch":"1
,"seqNum":7,"os":"Unix","osVer":"4.15.0.1096","ext":{"cloud":{"ver":"1.1","name":"appabc","role":"appabcfd","roleInstance":"reg
stration-dpl-766fd87c4b-nkfrx","location":"westus2","environment":"Stress"},"sll":{"libVer":"5.2.38","level":3}},"data":{"baseType":"Ms.Cor
elationContext.CorrelationContext","baseData":{},"message":"Application started"}}
2020-09-24T16:57:43.228921889Z {"ver":"2.1","name":"HttpClientLoggingMessageHandlerEvents.ExternalRequestCompleted","time":"2020-09-24T16:5
:42.3246283Z","epoch":"1","seqNum":9,"os":"Unix","osVer":"4.15.0.1096","cV":"Ix4s+C5b9UW/wV9zafOq1w.1","ext":{"cloud":{"ver":"1.1","name":"
egistration","role":"appabcfd","roleInstance":"appabc-dpl-766fd87c4b-nkfrx","location":"westus2","environment":"Stress"},"sll":
"libVer":"5.2.38","level":3}},"data":{"baseType":"Ms.Qos.OutgoingServiceRequest","baseData":{"operationName":"ExternalRequest","targetUri":
https://auth-svc.auth:10443/v1/keys","latencyMs":195,"succeeded":true,"requestMethod":"GET","protocol":"HTTP","protocolStatusCode":"OK","de
endencyOperationName":"JsonWebKeyResolver","dependencyName":"auth-svc.auth","requestStatus":3},"friendlyName":"JsonWebKeyResolver","message
:"GET https://auth-svc.auth:10443/v1/keys : OK (195 ms)"}}
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
Bug Report
What is the issue?
Upgrade our linkerd version from stable-2.7.1 to edge-2.9.1 was resulting application container going into a crash loop. Crash loop happened to many of our application pods in many regions, but it’s happened in few pods in a namespace. Crash loop happened to few pods in some namespace.
How can it be reproduced?
Logs, error output, etc
pod describe
proxy container log
application container log
linkerd check
outputEnvironment
Possible solution
l think it can be linkerd-proxy containers crashing and not being restarted by their liveliness probes. This resulted in broken connectivity for the service container and could result in the service container going into a crash loop. Can it be related to linkerd livenss prob endpoint is changed on edge-20.4.4 release?
To mitigate this, we restarted linked-proxy container and pod comes out of crash loop.
kubectl exec -it appxyz-dpl-88f47658f-dh6ct -n appxyz -c linkerd-proxy -- /bin/sh -c "kill 1"
Additional context