tkestack / tke

Native Kubernetes container management platform supporting multi-tenant and multi-cluster
Other
1.47k stars 330 forks source link

cluster health check will failed in most case bacause the socket connection to the target cluster is not stable #2310

Closed pikehuang closed 10 months ago

pikehuang commented 10 months ago

What happened: platform control will call checkHealth to update the cluster status. In production use case, we have total 100 clusters however 70 of them have failed cluster status, however the failed clusters are running well when we ssh to check their status. The detailed info is shown in following: failed-clusters total-failed-clusters

What you expected to happen: wo hope that the cluster status keeps the same with its real status, whose most life lives in running.

How to reproduce it (as minimally and precisely as possible): make the cluster in heavy network pressure or move the cluster from cloud to idc environment.

Anything else we need to know?: the health check is not correct in most case, if there is monitor system the issue is easy to find.

Environment: