karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.11k stars 805 forks source link

karmada-scheduler add disable-scheduler-estimator-in-pull-mode flag #2064

Closed prodanlabs closed 1 year ago

prodanlabs commented 1 year ago

Signed-off-by: prodan pengshihaoren@gmail.com

What type of PR is this?

/kind bug /kind feature

What this PR does / why we need it:

In pull mode, if the network of the member cluster and karmada are not two-way communication, scheduler-estimator is not available.,In this scenario, we need to add a flag to disable scheduler-estimator in pull mode.

For example, my cluster2 is in pull mode, and scheduler-estimator is not installed, the scheduler has been looking for karmada-scheduler-estimator-cluster2.

root@dev-karmada-cluster02:~# kubectl  --kubeconfig /etc/karmada/karmada-apiserver.config get cluster
NAME       VERSION   MODE   READY   AGE
cluster1   v1.23.8   Push   True    72m
cluster2   v1.22.0   Pull   True    69m
I0625 09:19:38.584300       1 cache.go:87] Start dialing estimator server(karmada-scheduler-estimator-cluster2:10352) of cluster(cluster2).
W0625 09:19:38.587111       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
W0625 09:19:39.590520       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
W0625 09:19:41.431697       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
E0625 09:19:43.584807       1 cache.go:90] Failed to dial cluster(cluster2): dial karmada-scheduler-estimator-cluster2:10352 error: context deadline exceeded.
I0625 09:19:43.745008       1 cache.go:87] Start dialing estimator server(karmada-scheduler-estimator-cluster2:10352) of cluster(cluster2).
W0625 09:19:43.748414       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
W0625 09:19:44.752193       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
W0625 09:19:46.094775       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
W0625 09:19:48.258426       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
E0625 09:19:48.745915       1 cache.go:90] Failed to dial cluster(cluster2): dial karmada-scheduler-estimator-cluster2:10352 error: context deadline exceeded.
I0625 09:19:49.066524       1 cache.go:87] Start dialing estimator server(karmada-scheduler-estimator-cluster2:10352) of cluster(cluster2).
W0625 09:19:49.069144       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
W0625 09:19:50.072502       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
W0625 09:19:51.583913       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
E0625 09:19:54.066914       1 cache.go:90] Failed to dial cluster(cluster2): dial karmada-scheduler-estimator-cluster2:10352 error: context deadline exceeded.
I0625 09:19:54.707938       1 cache.go:87] Start dialing estimator server(karmada-scheduler-estimator-cluster2:10352) of cluster(cluster2).
W0625 09:19:54.710377       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
W0625 09:19:55.713421       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
W0625 09:19:57.567846       1 clientconn.go:1322] [core] grpc: addrConn.createTransport failed to connect to {karmada-scheduler-estimator-cluster2:10352 karmada-scheduler-estimator-cluster2:10352 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup karmada-scheduler-estimator-cluster2 on 10.96.0.10:53: no such host"
E0625 09:19:59.708969       1 cache.go:90] Failed to dial cluster(cluster2): dial karmada-scheduler-estimator-cluster2:10352 error: context deadline exceeded.

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

`karmada-scheduler`: Introduced  `--disable-scheduler-estimator-in-pull-mode` flag to disable scheduler-estimator for pull mode clusters.
Garrybest commented 1 year ago

Actually scheduler-estimator is also used for pulled clusters. If you enable scheduler estimator option in karmada-scheduler, it will look for all estimators. Refer to https://github.com/karmada-io/karmada/blob/59219f4b6adb212a757ff624162fe080cdbb53b7/docs/scheduler-estimator.md#L33

prodanlabs commented 1 year ago

Actually scheduler-estimator is also used for pulled clusters. If you enable scheduler estimator option in karmada-scheduler, it will look for all estimators. Refer to

https://github.com/karmada-io/karmada/blob/59219f4b6adb212a757ff624162fe080cdbb53b7/docs/scheduler-estimator.md#L33

thank you for your reply.

I think the pull mode is used to solve the one-way network problem between the member cluster and karmada, so deploy scheduler-estimator in the host cluster, scheduler-estimator cannot connect to the kube-apiserver that connects to the member cluster.

In the case where the network can communicate in both directions, karmada should recommend users to use the push mode, because from the details, the push mode is more perfect.

Our private cloud's kube-apiserver is not open to the public network, including proxies, for security reasons,so scheduler-estimator is useless for our pull mode. FYR

Garrybest commented 1 year ago

Hi @prodanlabs, I got you. But some users choose the pull mode not for network reasons. AFAIK, just for HA or some other performance reasons. So I think we should try to use a tunnel or proxy to fix the network barrier, it may be not appropriate to just disable scheduler-estimator in pull mode. How do you think?

prodanlabs commented 1 year ago

@Garrybest Thanks for the quick response

Yes, I admit that some users still use pull mode even when the network can communicate in both directions.

So I think we should try to use a tunnel or proxy to fix the network barrier

In fact, karmada also provides anp solutions to solve network obstacles. In pull mode, karmada-kubectl logs, exec and other subcommands can also get logs or enter Pod.

But as far as we are concerned, because of security management regulations, the core business systems on the private cloud are not allowed to be accessed through the public network, nor can network agents or network tunnels be deployed.

My idea is that push mode is recommended when there is no network failure. Pull mode does not need to deploy scheduler estimator, or compromise, in pull mode, scheduler adds a flag to close scheduler estimator event.

Garrybest commented 1 year ago

It makes sense. We could add a flag to avoid establishing the connection with clusters in pull mode. Would you like to revise again?

prodanlabs commented 1 year ago

It makes sense. We could add a flag to avoid establishing the connection with clusters in pull mode. Would you like to revise again?

OK

prodanlabs commented 1 year ago

Hi @Garrybest , do you have any good suggestions for flag names?

prodanlabs commented 1 year ago

@Garrybest @RainbowMango can you please review?

Garrybest commented 1 year ago

Hi @prodanlabs, this flag is not just used for disable event. If we don't add pull-mode clusters into schedulerEstimatorWorker, the scheduler will never establish a connection with them. So the flag is to disable scheduler estimator in pull-mode clusters.

The option could be like DisableSchedulerEstimatorInPullMode.

prodanlabs commented 1 year ago

Hi @prodanlabs, this flag is not just used for disable event. If we don't add pull-mode clusters into schedulerEstimatorWorker, the scheduler will never establish a connection with them. So the flag is to disable scheduler estimator in pull-mode clusters.

The option could be like DisableSchedulerEstimatorInPullMode.

Agreed, I'll change it later. thanks

prodanlabs commented 1 year ago

Hi @prodanlabs, this flag is not just used for disable event. If we don't add pull-mode clusters into schedulerEstimatorWorker, the scheduler will never establish a connection with them. So the flag is to disable scheduler estimator in pull-mode clusters.

The option could be like DisableSchedulerEstimatorInPullMode.

done.

/cc @Garrybest @RainbowMango

Garrybest commented 1 year ago

/assign

Garrybest commented 1 year ago

/lgtm

Thanks!

prodanlabs commented 1 year ago

Hi @XiShanYongYe-Chang , can you help take a look. https://github.com/karmada-io/karmada/runs/7123991413?check_suite_focus=true

XiShanYongYe-Chang commented 1 year ago

Hi @prodanlabs, have you rebased the newest code in the master branch?

prodanlabs commented 1 year ago

Hi @prodanlabs, have you rebased the newest code in the master branch?

my master branch is not up to date.

XiShanYongYe-Chang commented 1 year ago

There is a bug #2072 which we have fixed by #2074, maybe you can rebase the master branch.

prodanlabs commented 1 year ago

How to rerun CI .

XiShanYongYe-Chang commented 1 year ago

How to rerun CI .

Rebase and push again?

prodanlabs commented 1 year ago

Rebase and push again?

Last time I seemed to see @RainbowMango used the command to rerun CI, I forgot where I saw it

RainbowMango commented 1 year ago

Last time I seemed to see @RainbowMango used the command to rerun CI, I forgot where I saw it

No command can be used to re-trigger the test. sad...

RainbowMango commented 1 year ago

/lgtm /approve /hold cancel

karmada-bot commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/karmada-io/karmada/blob/master/OWNERS)~~ [RainbowMango] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment