googlecloudrobotics / core

Cloud Robotics Core: Kubernetes, Federation, App Management
Apache License 2.0
192 stars 61 forks source link

I/O Timeouts to Kubernetes API in Cloud Cluster lead to cr-syncer crash loop #67

Closed oliver-goetz closed 3 years ago

oliver-goetz commented 3 years ago

I created a little test case to check how cr-syncer reacts on connectivity issues

The first three lines of the following log are the regular log entries when connection was available. After restarting nginx there are only downstream events left which sounds reasonable. After a couple of more events there is a i/o timeout error and cr-syncer restarts without further notice.

2021/06/24 15:54:36 Got update event from downstream for Robot mir-robot01@v9508595
2021/06/24 15:54:36 Got update event from upstream for Robot mir-robot01@v334026459
2021/06/24 15:54:36 Copied Robot mir-robot01 status@v9508595 to upstream@v334026459
2021/06/24 15:54:38 Got update event from downstream for Robot mir-robot01@v9508599
2021/06/24 15:54:40 Got update event from downstream for Robot mir-robot01@v9508602
2021/06/24 15:54:42 Got update event from downstream for Robot mir-robot01@v9508606
2021/06/24 15:54:44 Got update event from downstream for Robot mir-robot01@v9508609
2021/06/24 15:54:46 Got update event from downstream for Robot mir-robot01@v9508612
2021/06/24 15:54:48 Got update event from downstream for Robot mir-robot01@v9508616
2021/06/24 15:54:50 Got update event from downstream for Robot mir-robot01@v9508619
2021/06/24 15:54:52 Got update event from downstream for Robot mir-robot01@v9508624
2021/06/24 15:54:54 Got update event from downstream for Robot mir-robot01@v9508627
2021/06/24 15:54:56 Got update event from downstream for Robot mir-robot01@v9508630
2021/06/24 15:54:58 Got update event from downstream for Robot mir-robot01@v9508634
2021/06/24 15:55:00 Got update event from downstream for Robot mir-robot01@v9508637
2021/06/24 15:55:02 Got update event from downstream for Robot mir-robot01@v9508642
2021/06/24 15:55:04 Got update event from downstream for Robot mir-robot01@v9508645
2021/06/24 15:55:06 Got update event from downstream for Robot mir-robot01@v9508648
2021/06/24 15:55:08 Got update event from downstream for Robot mir-robot01@v9508652
2021/06/24 15:55:08 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 334026459: update status failed: Put "http://www.endpoints.sap-asimov-master.cloud.goog/apis/registry.cloudrobotics.com/v1alpha1/namespaces/default/robots/mir-robot01/status": dial tcp 34.77.126.185:443: i/o timeout
2021/06/24 15:55:10 Got update event from downstream for Robot mir-robot01@v9508655

Cr-syncer is unable to start without connectivity to the Cloud Kubernetes API which causes a crash loop after a while

2021/06/24 15:55:10 Syncing cache for CRDs
2021/06/24 15:55:10 Starting syncer for warehouseorders.ewm.sap.com
2021/06/24 15:55:10 Starting syncer for robottypes.registry.cloudrobotics.com
2021/06/24 15:55:10 Starting syncer for robotconfigurations.ewm.sap.com
2021/06/24 15:55:10 skipping custom resource auctioneers.ewm.sap.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource orderreservations.ewm.sap.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource podmonitors.monitoring.coreos.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource prometheusrules.monitoring.coreos.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource servicemonitors.monitoring.coreos.com: unknown spec source ""
2021/06/24 15:55:10 Starting syncer for orderauctions.ewm.sap.com
2021/06/24 15:55:10 skipping custom resource alertmanagers.monitoring.coreos.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource resourcesets.apps.cloudrobotics.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource prometheuses.monitoring.coreos.com: unknown spec source ""
2021/06/24 15:55:10 Starting syncer for chartassignments.apps.cloudrobotics.com
2021/06/24 15:55:10 skipping custom resource thanosrulers.monitoring.coreos.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource runtimeestimations.ewm.sap.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource missions.mission.cloudrobotics.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource probes.monitoring.coreos.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource approllouts.apps.cloudrobotics.com: unknown spec source ""
2021/06/24 15:55:10 skipping custom resource apps.apps.cloudrobotics.com: unknown spec source ""
2021/06/24 15:55:10 Starting syncer for robots.registry.cloudrobotics.com
I0624 15:55:40.762973       1 trace.go:116] Trace[605764324]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:105 (started: 2021-06-24 15:55:10.756470521 +0000 UTC m=+0.237934965) (total time: 30.006466637s):
Trace[605764324]: [30.006466637s] [30.006466637s] END
E0624 15:55:40.819649       1 reflector.go:153] external/io_k8s_client_go/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: Get "http://www.endpoints.sap-asimov-master.cloud.goog/apis/ewm.sap.com/v1alpha1/namespaces/default/warehouseorders?labelSelector=cloudrobotics.com%2Frobot-name%3Dmir-robot01&limit=500&resourceVersion=0": dial tcp 34.77.126.185:443: i/o timeout
I0624 15:55:40.773688       1 trace.go:116] Trace[1693516159]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:105 (started: 2021-06-24 15:55:10.765553951 +0000 UTC m=+0.247018395) (total time: 30.008076121s):
Trace[1693516159]: [30.008076121s] [30.008076121s] END
I0624 15:55:40.774720       1 trace.go:116] Trace[776971353]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:105 (started: 2021-06-24 15:55:10.761586528 +0000 UTC m=+0.243050951) (total time: 30.013066302s):
Trace[776971353]: [30.013066302s] [30.013066302s] END
I0624 15:55:40.774989       1 trace.go:116] Trace[1890951957]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:105 (started: 2021-06-24 15:55:10.758547028 +0000 UTC m=+0.240011470) (total time: 30.016396594s):
Trace[1890951957]: [30.016396594s] [30.016396594s] END
I0624 15:55:40.775096       1 trace.go:116] Trace[638043721]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:105 (started: 2021-06-24 15:55:10.759200795 +0000 UTC m=+0.240665237) (total time: 30.015881088s):
Trace[638043721]: [30.015881088s] [30.015881088s] END
I0624 15:55:40.776090       1 trace.go:116] Trace[1920627189]: "Reflector ListAndWatch" name:external/io_k8s_client_go/tools/cache/reflector.go:105 (started: 2021-06-24 15:55:10.763329965 +0000 UTC m=+0.244794409) (total time: 30.012738232s):
Trace[1920627189]: [30.012738232s] [30.012738232s] END
E0624 15:55:40.820295       1 reflector.go:153] external/io_k8s_client_go/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: Get "http://www.endpoints.sap-asimov-master.cloud.goog/apis/apps.cloudrobotics.com/v1alpha1/chartassignments?labelSelector=cloudrobotics.com%2Frobot-name%3Dmir-robot01&limit=500&resourceVersion=0": dial tcp 34.77.126.185:443: i/o timeout
E0624 15:55:40.820363       1 reflector.go:153] external/io_k8s_client_go/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: Get "http://www.endpoints.sap-asimov-master.cloud.goog/apis/registry.cloudrobotics.com/v1alpha1/namespaces/default/robots?labelSelector=cloudrobotics.com%2Frobot-name%3Dmir-robot01&limit=500&resourceVersion=0": dial tcp 34.77.126.185:443: i/o timeout
E0624 15:55:40.820762       1 reflector.go:153] external/io_k8s_client_go/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: Get "http://www.endpoints.sap-asimov-master.cloud.goog/apis/ewm.sap.com/v1alpha1/namespaces/default/orderauctions?labelSelector=cloudrobotics.com%2Frobot-name%3Dmir-robot01&limit=500&resourceVersion=0": dial tcp 34.77.126.185:443: i/o timeout
E0624 15:55:40.820813       1 reflector.go:153] external/io_k8s_client_go/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: Get "http://www.endpoints.sap-asimov-master.cloud.goog/apis/registry.cloudrobotics.com/v1alpha1/namespaces/default/robottypes?limit=500&resourceVersion=0": dial tcp 34.77.126.185:443: i/o timeout
E0624 15:55:40.820849       1 reflector.go:153] external/io_k8s_client_go/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: Get "http://www.endpoints.sap-asimov-master.cloud.goog/apis/ewm.sap.com/v1alpha1/namespaces/default/robotconfigurations?labelSelector=cloudrobotics.com%2Frobot-name%3Dmir-robot01&limit=500&resourceVersion=0": dial tcp 34.77.126.185:443: i/o timeout

I'm not entirely sure why cr-syncer is restarting in the first place as the Syncing key error is not a panic but just a log entry. However, in both cases the pod restarts 30 seconds after cr-syncer started (re-)connecting so it might be directly related to that event.

drigz commented 3 years ago

What does kubectl describe say about why the cr-syncer is terminating?

oliver-goetz commented 3 years ago

The new health check seems to be the reason for restarting. But the check is not really failing, it is just taking too long. The I/O timeout seems to be 30 seconds. That's probably the same for the list call in health check, but the liveness check timeout is just 10s

Events:
  Type     Reason     Age                From     Message
  ----     ------     ----               ----     -------
  Normal   Pulled     8s (x14 over 25h)  kubelet  Container image "gcr.io/sap-asimov-master/cr-syncer@sha256:de7733bba58c3350dbe86411617c59e45dbb74ff3eae9a5507ad2c6dc86c6d47" already present on machine
  Normal   Created    8s (x14 over 25h)  kubelet  Created container cr-syncer
  Normal   Started    8s (x14 over 25h)  kubelet  Started container cr-syncer
  Warning  Unhealthy  8s (x36 over 24h)  kubelet  Liveness probe failed: Get "http://192.168.9.9:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    8s (x12 over 24h)  kubelet  Container cr-syncer failed liveness probe, will be restarted
oliver-goetz commented 3 years ago

Increasing timeoutSeconds to 35 in cr-syncer deployment solves the reconnect and the startup issue.

However, the reconnect case is not running perfect yet. After I blocked the cr-syncer traffic for 4 minutes, it took 14 minutes for it to recover after I removed the block. I suspect that exponential backoff in workqueue.DefaultControllerRateLimiter is the reason for this. I'll test a different RateLimiter as I did here earlier this year

drigz commented 3 years ago

Thanks for the update and for testing this case! I've sent a change for timeoutSeconds. Please send a PR to change the ratelimiter if you find something that works better. I was originally skeptical about deviating from the best practice of randomized exponential backoff, but in our case I think a fixed backoff or a shorter maximum backoff would make sense, since we have a bounded number of clients and it's more important to restore functionality quickly after a network outage.