karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.12k stars 806 forks source link

Support Connection to ResourceInterpretWebhook without DNS Service #2999

Closed lxtywypc closed 1 year ago

lxtywypc commented 1 year ago

What type of PR is this? /kind feature

What this PR does / why we need it: If we deploy karmada with Default dnsPolicy or deploy karmada in physical machine/virtual machine directly, and there is no coreDNS or other DNS services, it would not connect to resource-interpret-webhooks successfully with DefaultServiceResolver.

Which issue(s) this PR fixes: Fixes #2998

Special notes for your reviewer: NONE

Does this PR introduce a user-facing change?:

`karmada-controller-manager`/`karmada-agent`: support connection to resourceInterpretWebhook without DNS Service.
karmada-bot commented 1 year ago

Welcome @lxtywypc! It looks like this is your first PR to karmada-io/karmada 🎉

codecov-commenter commented 1 year ago

Codecov Report

Merging #2999 (71ffbf4) into master (7f2b660) will decrease coverage by 0.03%. The diff coverage is n/a.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master    #2999      +/-   ##
==========================================
- Coverage   51.63%   51.61%   -0.03%     
==========================================
  Files         210      210              
  Lines       18926    18926              
==========================================
- Hits         9773     9769       -4     
- Misses       8622     8625       +3     
- Partials      531      532       +1     
Flag Coverage Δ
unittests 51.61% <ø> (-0.03%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

see 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

XiShanYongYe-Chang commented 1 year ago

Hi @lxtywypc, the E2E has failed in the CI, have you tested it on your site?

lxtywypc commented 1 year ago

Hi @lxtywypc, the E2E has failed in the CI, have you tested it on your site?

Humm... I don't know well about how e2e test works, but it may show different error after each retry, and it seems to be ok after some retries.

RainbowMango commented 1 year ago

The E2E still failing after retrying. @XiShanYongYe-Chang

XiShanYongYe-Chang commented 1 year ago

Hi @lxtywypc, I test with this pr on my side, after I deployed Karmada, I find that the status condition of the cluster member3 (Pull mode) is Unknown, and when I delete the old karmada-controller-manager pod, the logs of new pod report an panic:

image

lxtywypc commented 1 year ago

Hi @lxtywypc, I test with this pr on my side, after I deployed Karmada, I find that the status condition of the cluster member3 (Pull mode) is Unknown, and when I delete the old karmada-controller-manager pod, the logs of new pod report an panic:

image

Ok, I will check again later

lxtywypc commented 1 year ago

@XiShanYongYe-Chang @RainbowMango It should be ok this time.
I used a for-select loop to make sure the informer had synced, and in the early version I made it sleep 1s if the loop hadn't been done.
https://github.com/karmada-io/karmada/pull/2999/commits/4bbdf7470d510b4436f5ba2e19eaa6a1b55805b4#diff-a111978a679a9f03e49c67a43ecb913299d153e59ce48d2d30f5029f27ab601aR532
It seems that some calling of resource-interpreter comes early during this period, then it causes a terrible panic of empty interpreter pointer. This time I build service informer and wait for it synced before newing a resource-interpreter. It can work now.
But since it would cause a panic for a period of sleep during starting interpreter, I think it may still have a potential panic risk if it process too long during starting interpreter. How do you think about?

XiShanYongYe-Chang commented 1 year ago

/assign

XiShanYongYe-Chang commented 1 year ago

Thanks, @lxtywypc, can you help compress your commits into one?

Ask for @ikaven1024 @RainbowMango to retake a review. /assign @ikaven1024 @RainbowMango

lxtywypc commented 1 year ago

Thanks, @lxtywypc, can you help compress your commits into one?

Ok, but how about vendor? @XiShanYongYe-Chang

ikaven1024 commented 1 year ago

It LGTM

XiShanYongYe-Chang commented 1 year ago

/cc @RainbowMango Ask for @RainbowMango to help take a look.

karmada-bot commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/karmada-io/karmada/blob/master/OWNERS)~~ [RainbowMango] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
XiShanYongYe-Chang commented 1 year ago

Hi @lxtywypc can you help post your test result here? A simple test process report is enough.

lxtywypc commented 1 year ago

Hi @XiShanYongYe-Chang , I'm sorry for replying late. I'll be glad to raise a test report, but what kind report would you like? Could you offer me a template?

XiShanYongYe-Chang commented 1 year ago

I'll be glad to raise a test report, but what kind report would you like? Could you offer me a template?

Thanks~

I'm sorry, I don't have a specific template here either. My idea is that you can simply explain your local test process and post your test results here will be ok.

lxtywypc commented 1 year ago

My idea is that you can simply explain your local test process and post your test results here will be ok.

Ok, I'll do it later.

lxtywypc commented 1 year ago

Here is the test report:

Note: Some parts with the information of company in the example has been hidden.

  1. Prepare: We has a workload CR whose kind is Ro, and a resource-interpreter-webhook in cluster named ro-webhook and implemented the InterpretReplica point:
    $ kubectl get svc -n karmada-system ro-webhook
    NAME         TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
    ro-webhook   ClusterIP   10.7.93.168   <none>        443/TCP   196d
    $ kubectl get resourceinterpreterwebhookconfiguration ro-webhook -o jsonpath='{"service: "}{.webhooks[0].clientConfig.service}{"\nrules: "}{.webhooks[0].rules}{"\n"}'
    service: {"name":"ro-webhook","namespace":"karmada-system","path":"/interpret-ro","port":443}
    rules: [{"apiGroups":["xxx.xxx"],"apiVersions":["v1"],"kinds":["Ro"],"operations":["InterpretReplica"]}]
  2. First, we apply a workload named test with 0 replicas in namespace test, and karmada-controller-manager with image based on branch master:
    $ kubectl get ro -n test test -o jsonpath='{.spec.replicas}{"\n"}'
    0
    $ kubectl get deploy -n karmada-system karmada-controller-manager -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
    xxx.xxx/karmada/karmada-controller-manager:master
  3. Then we scale the workload up to 3, and get it's replicas in resourcebinding:

    $ kubectl scale ro -n test test --replicas=3
    xxx.xxx/test scaled
    $ kubectl get rb -n test test-ro -o jsonpath='{.spec.replicas}{"\n"}'
    
    $

    Then we look up the log of karmada-controller-manager:

    $ kubectl exec -it karmada-controller-manager-6db886b97f-4fb5q -n karmada-system -- /bin/sh
    / # cd /var/log/karmada-controller-manager/
    /var/log/karmada-controller-manager # cat runtime.log | grep test
    I0420 07:08:18.189526       1 customized.go:74] Get replicas for object: xxx.xxx/v1, Kind=Ro test/test with webhook interpreter.
    E0420 07:08:18.195101       1 detector.go:649] Failed to customize replicas for xxx.xxx/v1, Kind=Ro(test), Internal error occurred: failed calling webhook "ro-webhook/xxx.xxx": failed to call webhook: Post "https://ro-webhook.karmada-system.svc:443/interpret-ro?timeout=3s": dial tcp: lookup ro-webhook.karmada-system.svc on 10.5.168.10:53: no such host
    E0420 07:08:18.195121       1 detector.go:472] Failed to build resourceBinding for object: xxx.xxx/v1, kind=Ro, test/test. error: Internal error occurred: failed calling webhook "ro-webhook/xxx.xxx": failed to call webhook: Post "https://ro-webhook.karmada-system.svc:443/interpret-ro?timeout=3s": dial tcp: lookup ro-webhook.karmada-system.svc on 10.5.168.10:53: no such host
    I0420 07:08:18.195198       1 recorder.go:103] "events: Apply cluster policy(ro) failed: Internal error occurred: failed calling webhook \"ro-webhook/xxx.xxx\": failed to call webhook: Post \"https://ro-webhook.karmada-system.svc:443/interpret-ro?timeout=3s\": dial tcp: lookup ro-webhook.karmada-system.svc on 10.5.168.10:53: no such host" type="Warning" object={Kind:Ro Namespace:test Name:test UID:0f9fc62b-26ea-45da-8b18-085f1b167864 APIVersion:xxx.xxx/v1 ResourceVersion:643725715 FieldPath:} reason="ApplyPolicyFailed"
  4. Now, we change the image of karmada-controller-manager to the image based on this pull request, and get the replicas in resourcebinding again:
    $ kubectl get deploy -n karmada-system karmada-controller-manager -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
    xxx.xxx/karmada/karmada-controller-manager:master-with-resolver
    $ kubectl get rb -n test test-ro -o jsonpath='{.spec.replicas}{"\n"}'
    3

    Let's look up the log of karmada-controller-manager again:

    $ kubectl exec -it karmada-controller-manager-57dcd97579-k9shq -n karmada-system -- /bin/sh
    / # cd /var/log/karmada-controller-manager/
    /var/log/karmada-controller-manager # cat runtime.log | grep test
    I0420 07:32:17.723681       1 customized.go:77] Get replicas for object: xxx.xxx/v1, Kind=Ro test/test with webhook interpreter.
    I0420 07:32:17.735193       1 detector.go:510] Update ResourceBinding(test-ro) successfully.
    I0420 07:32:17.735240       1 recorder.go:103] "events: Apply cluster policy(ro) succeed" type="Normal" object={Kind:Ro Namespace:test Name:test UID:0f9fc62b-26ea-45da-8b18-085f1b167864 APIVersion:xxx.xxx/v1 ResourceVersion:643725715 FieldPath:} reason="ApplyPolicySucceed"

The webhook can be accessed successfully, and we haven't met any other problems yet.

lxtywypc commented 1 year ago

/cc @XiShanYongYe-Chang @RainbowMango

XiShanYongYe-Chang commented 1 year ago

@lxtywypc a clear test report, thanks a lot.

/lgtm /hold cancel

Can you help update the release-note, such as:

karmada-controller-manager/karmada-agent: support connection to resourceInterpretWebhook without DNS Service
lxtywypc commented 1 year ago

@XiShanYongYe-Chang I'm glad to. But I don't know clearly what I should do.
Should I cherry pick this PR to other release branches and make tags?

XiShanYongYe-Chang commented 1 year ago

I'm glad to. But I don't know clearly what I should do.

Modify this place will be ok: image

Should I cherry pick this PR to other release branches and make tags?

We usually synchronize bugs to previous versions, and the current change is more like a feature to us. Do you need to use it in previous versions?

RainbowMango commented 1 year ago

I don't think we need to cherry-pick this to release branches.

lxtywypc commented 1 year ago

@XiShanYongYe-Chang @RainbowMango I see. Thanks a lot.