tkestack / galaxy

Providing high-performance network for Kubernetes
Other
109 stars 39 forks source link

quick start failed #123

Closed currycan closed 3 years ago

currycan commented 3 years ago

k8s version:

# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

galaxy version: v1.0.7 flannel version: v0.13.0

following the guide, just change the command to private-cloud . The logs as follow:

I0113 17:00:58.195570   11205 flags.go:52] FLAG: --add-dir-header="false"
I0113 17:00:58.195915   11205 flags.go:52] FLAG: --alsologtostderr="false"
I0113 17:00:58.195923   11205 flags.go:52] FLAG: --bridge-nf-call-iptables="true"
I0113 17:00:58.195932   11205 flags.go:52] FLAG: --cni-paths="[/opt/cni/galaxy/bin]"
I0113 17:00:58.195943   11205 flags.go:52] FLAG: --flannel-allocated-ip-dir="/var/lib/cni/networks,/var/lib/cni/networks/galaxy-flannel"
I0113 17:00:58.195950   11205 flags.go:52] FLAG: --flannel-gc-interval="10s"
I0113 17:00:58.195955   11205 flags.go:52] FLAG: --gc-dirs="/var/lib/cni/flannel,/var/lib/cni/galaxy,/var/lib/cni/galaxy/port"
I0113 17:00:58.195962   11205 flags.go:52] FLAG: --hostname-override=""
I0113 17:00:58.195966   11205 flags.go:52] FLAG: --ip-forward="true"
I0113 17:00:58.195971   11205 flags.go:52] FLAG: --json-config-path="/etc/galaxy/galaxy.json"
I0113 17:00:58.195978   11205 flags.go:52] FLAG: --kubeconfig=""
I0113 17:00:58.195983   11205 flags.go:52] FLAG: --log-backtrace-at=":0"
I0113 17:00:58.195991   11205 flags.go:52] FLAG: --log-dir=""
I0113 17:00:58.195996   11205 flags.go:52] FLAG: --log-file=""
I0113 17:00:58.196001   11205 flags.go:52] FLAG: --log-file-max-size="1800"
I0113 17:00:58.196006   11205 flags.go:52] FLAG: --log-flush-frequency="5s"
I0113 17:00:58.196030   11205 flags.go:52] FLAG: --logtostderr="true"
I0113 17:00:58.196044   11205 flags.go:52] FLAG: --master=""
I0113 17:00:58.196051   11205 flags.go:52] FLAG: --network-conf-dir="/etc/cni/net.d/"
I0113 17:00:58.196056   11205 flags.go:52] FLAG: --network-policy="false"
I0113 17:00:58.196061   11205 flags.go:52] FLAG: --pprof="false"
I0113 17:00:58.196066   11205 flags.go:52] FLAG: --route-eni="false"
I0113 17:00:58.196071   11205 flags.go:52] FLAG: --skip-headers="false"
I0113 17:00:58.196076   11205 flags.go:52] FLAG: --skip-log-headers="false"
I0113 17:00:58.196081   11205 flags.go:52] FLAG: --stderrthreshold="2"
I0113 17:00:58.196085   11205 flags.go:52] FLAG: --v="3"
I0113 17:00:58.196090   11205 flags.go:52] FLAG: --version="false"
I0113 17:00:58.196095   11205 flags.go:52] FLAG: --vmodule=""
I0113 17:00:58.196259   11205 galaxy.go:77] Json Config: {
  "NetworkConf":[
    {"name":"tke-route-eni","type":"tke-route-eni","eni":"eth1","routeTable":1},
    {"name":"galaxy-flannel","type":"galaxy-flannel", "delegate":{"type":"galaxy-veth"},"subnetFile":"/run/flannel/subnet.env"},
    {"name":"galaxy-k8s-vlan","type":"galaxy-k8s-vlan", "device":"eth1", "default_bridge_name": "br0"},
    {"name":"galaxy-k8s-sriov","type": "galaxy-k8s-sriov", "device": "eth1", "vf_num": 10}
  ],
  "DefaultNetworks": ["galaxy-flannel"],
  "ENIIPNetwork": "galaxy-k8s-vlan"
}
I0113 17:00:58.198627   11205 iptables.go:218] Could not connect to D-Bus system bus: dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory
W0113 17:00:58.198661   11205 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0113 17:00:58.198876   11205 galaxy.go:159] QPS: 1.000000e+03, Burst: 2000
I0113 17:00:58.200330   11205 galaxy.go:165] apiserver address https://172.31.0.1:443
I0113 17:00:58.258646   11205 portmapping.go:122] listening to tcp 10027
I0113 17:00:58.258666   11205 portmapping.go:138] Opened local port tcp:10027

And when creating a test pod, it failed. using the describe command:

  Warning  FailedCreatePodSandBox  20m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "823ddaf206a4e7dc81a8c978d19ca02dc509cffadc05db016f73886eab88c55b" network for pod "test-metallb-dpl-7fb5cc5679-hnqkq": networkPlugin cni failed to set up pod "test-metallb-dpl-7fb5cc5679-hnqkq_default" network: missing network name:
  Normal   SandboxChanged          10m (x252 over 20m)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  49s (x506 over 20m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "eef210c085fb8cde7397612098c735a57ea20122f5039a488744b32ead1a66f7" network for pod "test-metallb-dpl-7fb5cc5679-hnqkq": networkPlugin cni failed to set up pod "test-metallb-dpl-7fb5cc5679-hnqkq_default" network: missing network name:
chenchun commented 3 years ago

From the galaxy logs, it seems kubelet didn't call galaxy for setting up network for the pod, otherwise galaxy will print a log at https://github.com/tkestack/galaxy/blob/v1.0.7/pkg/galaxy/server.go#L114 . Do you have any other cni plugins installed? @currycan can you show us the output of

for i in `ls /etc/cni/net.d/`; do echo $i; cat /etc/cni/net.d/$i; done
currycan commented 3 years ago

@chenchun Hello, executing the command, it returns as follow:

# for i in `ls /etc/cni/net.d/`; do echo $i; cat /etc/cni/net.d/$i; done
00-galaxy.conf
{
  "type": "galaxy-sdn",
  "capabilities": {"portMappings": true},
  "cniVersion": "0.2.0"
}
10-flannel.conflist
{
  "name": "cbr0",
  "cniVersion":"0.3.1",
  "plugins": [
    {
      "type": "flannel",
      "delegate": {
        "forceAddress": true,
        "hairpinMode": true,
        "isDefaultGateway": true
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    }
  ]
}
chenchun commented 3 years ago

You can move away 10-flannel.conflist and have a try again. I believe the problem should be resolved.

currycan commented 3 years ago

You can move away 10-flannel.conflist and have a try again. I believe the problem should be resolved.

After removed the 10-flannel.conflist file, and then recreating the galaxy. There is nothing changes as the logs.


# for i in `ls /etc/cni/net.d/`; do echo $i; cat /etc/cni/net.d/$i; done
00-galaxy.conf
{
  "type": "galaxy-sdn",
  "capabilities": {"portMappings": true},
  "cniVersion": "0.2.0"
}
chenchun commented 3 years ago

@currycan Can you add a "name": "galaxy-sdn", into 00-galaxy.conf and have a try again?

currycan commented 3 years ago

@chenchun It works now! Appreciated for your help. BTW, if using the underlay network, how can I config it? The service out of the cluster can access the service in the cluster with pod IP

chenchun commented 3 years ago

Galaxy doesn't support auto register subnets to the switch via BGP or any other protocol. So first, you need to configure a network subnet on the switch for pod to use manually. But if that is not possible, pods may also use any none allocated ips of machine subnet.

Then all you have to do is to figure out the relation ship of node subnet to pod subnet, e.g. which pod subnet can be used in which node subnet, and make a floatingip configmap and starts galaxy-ipam.

currycan commented 3 years ago

floatingip-config is

kind: ConfigMap
apiVersion: v1
metadata:
  name: floatingip-config
  namespace: kube-system
data:
  floatingips: '[{"nodeSubnets":["10.177.140.0/22"],"ips":["10.177.140.40~10.177.140.80"],"subnet":"10.177.140.0/22","gateway":"10.177.143.254/22"}]'

galaxy-ipam-etc is

apiVersion: v1
kind: ConfigMap
metadata:
  name: galaxy-ipam-etc
  namespace: kube-system
data:
  # delete cloudProviderGrpcAddr if not ENI
  galaxy-ipam.json: |
    {
      "schedule_plugin": {
      }
    }

the galaxy-ipam deploymnet manifest file follow the guide. erros logs

I0115 14:31:20.821374       1 flags.go:52] FLAG: --add-dir-header="false"
I0115 14:31:20.821438       1 flags.go:52] FLAG: --alsologtostderr="false"
I0115 14:31:20.821443       1 flags.go:52] FLAG: --api-port="9041"
I0115 14:31:20.821450       1 flags.go:52] FLAG: --bind="0.0.0.0"
I0115 14:31:20.821456       1 flags.go:52] FLAG: --config="/etc/galaxy/galaxy-ipam.json"
I0115 14:31:20.821462       1 flags.go:52] FLAG: --kubeconfig=""
I0115 14:31:20.821466       1 flags.go:52] FLAG: --leader-elect="true"
I0115 14:31:20.821472       1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
I0115 14:31:20.821478       1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
I0115 14:31:20.821483       1 flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
I0115 14:31:20.821488       1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
I0115 14:31:20.821492       1 flags.go:52] FLAG: --log-backtrace-at=":0"
I0115 14:31:20.821500       1 flags.go:52] FLAG: --log-dir=""
I0115 14:31:20.821505       1 flags.go:52] FLAG: --log-file=""
I0115 14:31:20.821509       1 flags.go:52] FLAG: --log-file-max-size="1800"
I0115 14:31:20.821514       1 flags.go:52] FLAG: --log-flush-frequency="5s"
I0115 14:31:20.821518       1 flags.go:52] FLAG: --logtostderr="true"
I0115 14:31:20.821523       1 flags.go:52] FLAG: --master=""
I0115 14:31:20.821528       1 flags.go:52] FLAG: --port="9040"
I0115 14:31:20.821533       1 flags.go:52] FLAG: --profiling="true"
I0115 14:31:20.821537       1 flags.go:52] FLAG: --skip-headers="false"
I0115 14:31:20.821542       1 flags.go:52] FLAG: --skip-log-headers="false"
I0115 14:31:20.821546       1 flags.go:52] FLAG: --stderrthreshold="2"
I0115 14:31:20.821552       1 flags.go:52] FLAG: --swagger="false"
I0115 14:31:20.821558       1 flags.go:52] FLAG: --v="3"
I0115 14:31:20.821562       1 flags.go:52] FLAG: --version="false"
I0115 14:31:20.821567       1 flags.go:52] FLAG: --vmodule=""
W0115 14:31:20.821716       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0115 14:31:20.821936       1 server.go:171] QPS: 1.000000e+03, Burst: 2000
I0115 14:31:20.866203       1 server.go:192] connected to apiserver &rest.Config{Host:"https://172.31.0.1:443", APIPath:"", ContentConfig:rest.ContentConfig{AcceptContentTypes:"", ContentType:"", GroupVersion:(*schema.GroupVersion)(nil), NegotiatedSerializer:runtime.NegotiatedSerializer(nil)}, Username:"", Password:"", BearerToken:"--- REDACTED ---", BearerTokenFile:"/var/run/secrets/kubernetes.io/serviceaccount/token", Impersonate:rest.ImpersonationConfig{UserName:"", Groups:[]string(nil), Extra:map[string][]string(nil)}, AuthProvider:<nil>, AuthConfigPersister:rest.AuthProviderConfigPersister(nil), ExecProvider:<nil>, TLSClientConfig:rest.sanitizedTLSClientConfig{Insecure:false, ServerName:"", CertFile:"", KeyFile:"", CAFile:"/var/run/secrets/kubernetes.io/serviceaccount/ca.crt", CertData:[]uint8(nil), KeyData:[]uint8(nil), CAData:[]uint8(nil)}, UserAgent:"", Transport:http.RoundTripper(nil), WrapTransport:(transport.WrapperFunc)(nil), QPS:1000, Burst:2000, RateLimiter:flowcontrol.RateLimiter(nil), Timeout:0, Dial:(func(context.Context, string, string) (net.Conn, error))(nil)}
I0115 14:31:20.907560       1 crd.go:79] Create CRD FloatingIP successfully.
I0115 14:31:20.951984       1 crd.go:79] Create CRD Pool successfully.
I0115 14:31:20.954454       1 floatingip_plugin.go:59] floating ip config: {[] 1 floatingip-config kube-system floatingips }
I0115 14:31:20.956263       1 leaderelection.go:235] attempting to acquire leader lease  kube-system/galaxy-ipam...
I0115 14:31:38.133187       1 leaderelection.go:245] successfully acquired lease kube-system/galaxy-ipam
I0115 14:31:38.133320       1 event.go:258] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"galaxy-ipam", UID:"bbcbd8c7-556f-4c3a-b905-e17a3fd59bae", APIVersion:"v1", ResourceVersion:"208175", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' k8s-node-01_ccf7d4e6-2aa5-4b14-9188-1d147f74fb1d became leader
I0115 14:31:38.133400       1 reflector.go:122] Starting reflector *v1.Pod (1m0s) from pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98
I0115 14:31:38.133447       1 reflector.go:160] Listing and watching *v1.Pod from pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98
I0115 14:31:38.133469       1 reflector.go:122] Starting reflector *v1.StatefulSet (1m0s) from pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98
I0115 14:31:38.133517       1 reflector.go:160] Listing and watching *v1.StatefulSet from pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98
I0115 14:31:38.133445       1 reflector.go:122] Starting reflector *v1alpha1.FloatingIP (0s) from pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98
I0115 14:31:38.133604       1 reflector.go:160] Listing and watching *v1alpha1.FloatingIP from pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98
I0115 14:31:38.133739       1 reflector.go:122] Starting reflector *v1alpha1.Pool (0s) from pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98
I0115 14:31:38.133793       1 reflector.go:160] Listing and watching *v1alpha1.Pool from pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98
I0115 14:31:38.133865       1 reflector.go:122] Starting reflector *v1.Deployment (1m0s) from pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98
I0115 14:31:38.133898       1 reflector.go:160] Listing and watching *v1.Deployment from pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98
I0115 14:31:38.634440       1 floatingip_plugin.go:82] empty floatingips from config, fetching from configmap
W0115 14:31:39.640118       1 floatingip_plugin.go:86] failed to unmarshal configmap val [{"nodeSubnets":["10.177.140.0/22"],"ips":["10.177.140.40~10.177.140.80"],"subnet":"10.177.140.0/22","gateway":"10.177.143.254/22"}] to floatingip config: invalid IP address: 10.177.143.254/22
W0115 14:31:40.643865       1 floatingip_plugin.go:86] failed to unmarshal configmap val [{"nodeSubnets":["10.177.140.0/22"],"ips":["10.177.140.40~10.177.140.80"],"subnet":"10.177.140.0/22","gateway":"10.177.143.254/22"}] to floatingip config: invalid IP address: 10.177.143.254/22
W0115 14:31:41.637906       1 floatingip_plugin.go:86] failed to unmarshal configmap val [{"nodeSubnets":["10.177.140.0/22"],"ips":["10.177.140.40~10.177.140.80"],"subnet":"10.177.140.0/22","gateway":"10.177.143.254/22"}] to floatingip config: invalid IP address: 10.177.143.254/22
W0115 14:31:42.639995       1 floatingip_plugin.go:86] failed to unmarshal configmap val [{"nodeSubnets":["10.177.140.0/22"],"ips":["10.177.140.40~10.177.140.80"],"subnet":"10.177.140.0/22","gateway":"10.177.143.254/22"}] to floatingip config: invalid IP address: 10.177.143.254/22
chenchun commented 3 years ago

Gateway address is an IP address instead of a cidr.

currycan commented 3 years ago

I fix the gateway address. And then create a demo pod, this is the manifest file:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: common-nginx
  labels:
    app: common-nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: common-nginx
  template:
    metadata:
      name: common-nginx
      labels:
        app: common-nginx
      annotations:
        k8s.v1.cni.cncf.io/networks: "galaxy-k8s-vlan"
    spec:
      containers:
      - name: nginx
        image: registry.tcnp.com/library/nginx
        resources:
          requests:
            tke.cloud.tencent.com/eni-ip: "1"
          limits:
            tke.cloud.tencent.com/eni-ip: "1"

BUT the pod is pending, describe the pod info :

  Warning  FailedScheduling  29s   default-scheduler  0/6 nodes are available: 3 node(s) were unschedulable, 3 Insufficient tke.cloud.tencent.com/eni-ip.
  Warning  FailedScheduling  29s   default-scheduler  0/6 nodes are available: 3 node(s) were unschedulable, 3 Insufficient tke.cloud.tencent.com/eni-ip.
chenchun commented 3 years ago

@currycan do you create sheduler config following https://github.com/tkestack/galaxy/blob/master/doc/galaxy-ipam-config.md#kubernetes-scheduler-configuration and make sure to update urlPrefix to the galaxy-ipam service address? And don't forget restart kube-scheduler to make policy config work.

currycan commented 3 years ago

@chenchun This is my schedule configmap which urlPrefix is nodePort way :

apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-policy
  namespace: kube-system
data:
  # set "ignoredByScheduler" to true if not ENI
  policy.cfg: |
    {
      "kind": "Policy",
      "apiVersion": "v1",
      "extenders": [
        {
          "urlPrefix": "http://10.177.140.16:32760/v1",
          "httpTimeout": 70000000000,
          "filterVerb": "filter",
          "BindVerb": "bind",
          "weight": 1,
          "enableHttps": false,
          "managedResources": [
            {
              "name": "tke.cloud.tencent.com/eni-ip",
              "ignoredByScheduler": false
            }
          ]
        }
      ]
    }

--policy-configmap="scheduler-policy" has already added to kube-scheduler and restarted it and the kube-scheduler logs is:

[root@k8s-master-01 ~]# kubectl logs -f -n kube-system kube-scheduler-10.177.140.16
I0118 08:00:42.084178       1 flags.go:59] FLAG: --add-dir-header="false"
I0118 08:00:42.084519       1 flags.go:59] FLAG: --address="0.0.0.0"
I0118 08:00:42.084535       1 flags.go:59] FLAG: --algorithm-provider=""
I0118 08:00:42.084541       1 flags.go:59] FLAG: --alsologtostderr="true"
I0118 08:00:42.084547       1 flags.go:59] FLAG: --authentication-kubeconfig="/etc/kubernetes/scheduler.conf"
I0118 08:00:42.084553       1 flags.go:59] FLAG: --authentication-skip-lookup="false"
I0118 08:00:42.084561       1 flags.go:59] FLAG: --authentication-token-webhook-cache-ttl="10s"
I0118 08:00:42.084569       1 flags.go:59] FLAG: --authentication-tolerate-lookup-failure="true"
I0118 08:00:42.084574       1 flags.go:59] FLAG: --authorization-always-allow-paths="[/healthz]"
I0118 08:00:42.084584       1 flags.go:59] FLAG: --authorization-kubeconfig="/etc/kubernetes/scheduler.conf"
I0118 08:00:42.084591       1 flags.go:59] FLAG: --authorization-webhook-cache-authorized-ttl="10s"
I0118 08:00:42.084597       1 flags.go:59] FLAG: --authorization-webhook-cache-unauthorized-ttl="10s"
I0118 08:00:42.084603       1 flags.go:59] FLAG: --bind-address="127.0.0.1"
I0118 08:00:42.084612       1 flags.go:59] FLAG: --cert-dir=""
I0118 08:00:42.084618       1 flags.go:59] FLAG: --client-ca-file=""
I0118 08:00:42.084625       1 flags.go:59] FLAG: --config=""
I0118 08:00:42.084630       1 flags.go:59] FLAG: --contention-profiling="true"
I0118 08:00:42.084637       1 flags.go:59] FLAG: --experimental-logging-sanitization="false"
I0118 08:00:42.084643       1 flags.go:59] FLAG: --feature-gates=""
I0118 08:00:42.084651       1 flags.go:59] FLAG: --hard-pod-affinity-symmetric-weight="1"
I0118 08:00:42.084659       1 flags.go:59] FLAG: --help="false"
I0118 08:00:42.084665       1 flags.go:59] FLAG: --http2-max-streams-per-connection="0"
I0118 08:00:42.084673       1 flags.go:59] FLAG: --kube-api-burst="200"
I0118 08:00:42.084679       1 flags.go:59] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0118 08:00:42.084689       1 flags.go:59] FLAG: --kube-api-qps="500"
I0118 08:00:42.084697       1 flags.go:59] FLAG: --kubeconfig="/etc/kubernetes/scheduler.conf"
I0118 08:00:42.084705       1 flags.go:59] FLAG: --leader-elect="true"
I0118 08:00:42.084711       1 flags.go:59] FLAG: --leader-elect-lease-duration="15s"
I0118 08:00:42.084717       1 flags.go:59] FLAG: --leader-elect-renew-deadline="10s"
I0118 08:00:42.084727       1 flags.go:59] FLAG: --leader-elect-resource-lock="leases"
I0118 08:00:42.084733       1 flags.go:59] FLAG: --leader-elect-resource-name="kube-scheduler"
I0118 08:00:42.084739       1 flags.go:59] FLAG: --leader-elect-resource-namespace="kube-system"
I0118 08:00:42.084745       1 flags.go:59] FLAG: --leader-elect-retry-period="2s"
I0118 08:00:42.084751       1 flags.go:59] FLAG: --lock-object-name="kube-scheduler"
I0118 08:00:42.084757       1 flags.go:59] FLAG: --lock-object-namespace="kube-system"
I0118 08:00:42.084764       1 flags.go:59] FLAG: --log-backtrace-at=":0"
I0118 08:00:42.084773       1 flags.go:59] FLAG: --log-dir="/var/log/kubernetes/kube-scheduler"
I0118 08:00:42.084781       1 flags.go:59] FLAG: --log-file=""
I0118 08:00:42.084786       1 flags.go:59] FLAG: --log-file-max-size="1800"
I0118 08:00:42.084793       1 flags.go:59] FLAG: --log-flush-frequency="5s"
I0118 08:00:42.084799       1 flags.go:59] FLAG: --logging-format="text"
I0118 08:00:42.084805       1 flags.go:59] FLAG: --logtostderr="false"
I0118 08:00:42.084811       1 flags.go:59] FLAG: --master=""
I0118 08:00:42.084817       1 flags.go:59] FLAG: --one-output="false"
I0118 08:00:42.084823       1 flags.go:59] FLAG: --permit-port-sharing="false"
I0118 08:00:42.084829       1 flags.go:59] FLAG: --policy-config-file=""
I0118 08:00:42.084835       1 flags.go:59] FLAG: --policy-configmap="scheduler-policy"
I0118 08:00:42.084841       1 flags.go:59] FLAG: --policy-configmap-namespace="kube-system"
I0118 08:00:42.084847       1 flags.go:59] FLAG: --port="10251"
I0118 08:00:42.084854       1 flags.go:59] FLAG: --profiling="false"
I0118 08:00:42.084860       1 flags.go:59] FLAG: --requestheader-allowed-names="[]"
I0118 08:00:42.084873       1 flags.go:59] FLAG: --requestheader-client-ca-file=""
I0118 08:00:42.084879       1 flags.go:59] FLAG: --requestheader-extra-headers-prefix="[x-remote-extra-]"
I0118 08:00:42.084887       1 flags.go:59] FLAG: --requestheader-group-headers="[x-remote-group]"
I0118 08:00:42.084897       1 flags.go:59] FLAG: --requestheader-username-headers="[x-remote-user]"
I0118 08:00:42.084904       1 flags.go:59] FLAG: --scheduler-name="default-scheduler"
I0118 08:00:42.084910       1 flags.go:59] FLAG: --secure-port="10259"
I0118 08:00:42.084916       1 flags.go:59] FLAG: --show-hidden-metrics-for-version=""
I0118 08:00:42.084922       1 flags.go:59] FLAG: --skip-headers="false"
I0118 08:00:42.084928       1 flags.go:59] FLAG: --skip-log-headers="false"
I0118 08:00:42.084934       1 flags.go:59] FLAG: --stderrthreshold="2"
I0118 08:00:42.084941       1 flags.go:59] FLAG: --tls-cert-file=""
I0118 08:00:42.084947       1 flags.go:59] FLAG: --tls-cipher-suites="[]"
I0118 08:00:42.084958       1 flags.go:59] FLAG: --tls-min-version=""
I0118 08:00:42.084964       1 flags.go:59] FLAG: --tls-private-key-file=""
I0118 08:00:42.084970       1 flags.go:59] FLAG: --tls-sni-cert-key="[]"
I0118 08:00:42.084978       1 flags.go:59] FLAG: --use-legacy-policy-config="false"
I0118 08:00:42.084984       1 flags.go:59] FLAG: --v="2"
I0118 08:00:42.084991       1 flags.go:59] FLAG: --version="false"
I0118 08:00:42.085001       1 flags.go:59] FLAG: --vmodule=""
I0118 08:00:42.085008       1 flags.go:59] FLAG: --write-config-to=""
I0118 08:00:43.444542       1 serving.go:331] Generated self-signed cert in-memory
I0118 08:00:45.914304       1 requestheader_controller.go:244] Loaded a new request header values for RequestHeaderAuthRequestController
I0118 08:00:46.043083       1 factory.go:210] Creating scheduler from configuration: {{ } [] [] [{http://10.177.140.16:32760/v1 filter   1 bind false <nil> {1m10s} false [{tke.cloud.tencent.com/eni-ip false}] false}] 0 false}
I0118 08:00:46.043232       1 factory.go:219] Using predicates from algorithm provider 'DefaultProvider'
I0118 08:00:46.043256       1 factory.go:230] Using default priorities
I0118 08:00:46.043269       1 factory.go:257] Creating scheduler with fit predicates 'map[CheckNodeUnschedulable:{} CheckVolumeBinding:{} EvenPodsSpread:{} GeneralPredicates:{} MatchInterPodAffinity:{} MaxAzureDiskVolumeCount:{} MaxCSIVolumeCountPred:{} MaxEBSVolumeCount:{} MaxGCEPDVolumeCount:{} NoDiskConflict:{} NoVolumeZoneConflict:{} PodToleratesNodeTaints:{}]' and priority functions 'map[BalancedResourceAllocation:1 EvenPodsSpreadPriority:2 ImageLocalityPriority:1 InterPodAffinityPriority:1 LeastRequestedPriority:1 NodeAffinityPriority:1 NodePreferAvoidPodsPriority:10000 SelectorSpreadPriority:1 TaintTolerationPriority:1]'
chenchun commented 3 years ago

@currycan can you change ignoredByScheduler to true so that tke.cloud.tencent.com/eni-ip resource won't be judged by kube-scheduler.

currycan commented 3 years ago

I'm sorry to tell you it doesn't work. This is the configmap:

[root@k8s-master-01 galaxy]# kubectl get cm -n kube-system scheduler-policy -o yaml
apiVersion: v1
data:
  policy.cfg: |
    {
      "kind": "Policy",
      "apiVersion": "v1",
      "extenders": [
        {
          "urlPrefix": "http://10.177.140.16:32760/v1",
          "httpTimeout": 70000000000,
          "filterVerb": "filter",
          "BindVerb": "bind",
          "weight": 1,
          "enableHttps": false,
          "managedResources": [
            {
              "name": "tke.cloud.tencent.com/eni-ip",
              "ignoredByScheduler": true
            }
          ]
        }
      ]
    }
kind: ConfigMap

After changing ignoredByScheduler to true, I restart the kubelet service and recreate the kube-scheduler pod

currycan commented 3 years ago

I found some other info about galax in http://www.iceyao.com.cn/2020/07/03/galaxy_source_code_readnote/, and then change the galaxy-etc as follow:

# kubectl -n kube-system get cm galaxy-etc -o yaml
apiVersion: v1
data:
  galaxy.json: |
    {
      "NetworkConf":[
        {"name":"tke-route-eni","type":"tke-route-eni","eni":"eth1","routeTable":1},
        {"name":"galaxy-flannel","type":"galaxy-flannel", "delegate":{"type":"galaxy-veth"},"subnetFile":"/run/flannel/subnet.env"},
        {"name":"galaxy-k8s-vlan","type":"galaxy-k8s-vlan", "device":"eth0", "switch":"ipvlan", "ipvlan_mode":"l2"},
        {"name":"galaxy-k8s-sriov","type": "galaxy-k8s-sriov", "device": "eth0", "vf_num": 10}
      ],
      "DefaultNetworks": ["galaxy-k8s-vlan"],
      "ENIIPNetwork": "galaxy-k8s-vlan"
    }
kind: ConfigMap

describe the pod info:

  Warning  FailedCreatePodSandBox  6m23s                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "38416c5c992a971c90f5765fe74c6d204d37b7a7752c3a243d539a1231faccbc" network for pod "common-nginx-mkk62": networkPlugin cni failed to set up pod "common-nginx-mkk62_default" network: galaxy returns: fail to establish network map[ipinfos:[{"ip":"10.177.140.46/22","vlan":0,"gateway":"10.177.143.254"}]]:failed to setup bridge Error getting device eth0: Link not found
  Warning  FailedCreatePodSandBox  6m14s (x4 over 6m21s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f2a4de16b9341ff4c6e08b263246a9a9353c89b302653097dd010ff2c8d124a3" network for pod "common-nginx-mkk62": networkPlugin cni failed to set up pod "common-nginx-mkk62_default" network: galaxy returns: fail to establish network map[ipinfos:[{"ip":"10.177.140.46/22","vlan":0,"gateway":"10.177.143.254"}]]:failed to setup bridge Error getting device eth0: Link not found
currycan commented 3 years ago

@chenchun I got it. The eth device is ens192 not eth0 or eth1. Thank you very much!

currycan commented 3 years ago

@chenchun I found something wrong with the probe of health check. manifest file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-floatingip
spec:
  strategy:
    type: Recreate
  replicas: 3
  selector:
    matchLabels:
      app: nginx-floatingip
  template:
    metadata:
      name: nginx-floatingip
      labels:
        app: nginx-floatingip
      annotations:
        k8s.v1.cni.cncf.io/networks: "galaxy-k8s-vlan"
        k8s.v1.cni.galaxy.io/release-policy: "immutable"
    spec:
      tolerations:
        - operator: "Exists"
      containers:
      - name: nginx
        image: nginx:alpine
        ports:
          - name: http-80
            containerPort: 80
        resources:
          requests:
            cpu: "0.1"
            memory: "32Mi"
            tke.cloud.tencent.com/eni-ip: "1"
          limits:
            cpu: "0.1"
            memory: "32Mi"
            tke.cloud.tencent.com/eni-ip: "1"
        livenessProbe:
          # httpGet:
          #   path: /
          #   port: 80
          #   scheme: HTTP
          tcpSocket:
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          failureThreshold: 3
          timeoutSeconds: 1
        readinessProbe:
          # httpGet:
          #   path: /
          #   port: 80
          #   scheme: HTTP
          tcpSocket:
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 2
          failureThreshold: 3
          timeoutSeconds: 1

health check failed in both tcpSocket and httpGet:

  Warning  FailedScheduling  82s                default-scheduler  deployment nginx-floatingip has allocated 3 ips with replicas of 3, wait for releasing
  Warning  FailedScheduling  82s                default-scheduler  deployment nginx-floatingip has allocated 3 ips with replicas of 3, wait for releasing
  Normal   Scheduled         78s                default-scheduler  Successfully assigned default/nginx-floatingip-5cdcd7bcbd-6ql2x to 10.177.140.18
  Warning  Unhealthy         16s (x3 over 36s)  kubelet            Liveness probe failed: dial tcp 10.177.140.44:80: i/o timeout