linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.69k stars 1.28k forks source link

Openshift 4.5 Install Fails on IPTables #4851

Closed jkassis closed 2 years ago

jkassis commented 4 years ago

Bug Report

What is the issue?

linkerd-controller pod wont start

How can it be reproduced?

Install Openshift 4.5.

Install linkerd as follows...

> brew install linkerd
> oc login
> oc new-project linkerd
> oc adm policy add-scc-to-user privileged -z linkerd-controller -n linkerd
> oc adm policy add-scc-to-user privileged -z linkerd-prometheus -n linkerd
> oc adm policy add-scc-to-user privileged -z default -n linkerd
> oc adm policy add-scc-to-user privileged -z linkerd-destination -n linkerd
> oc adm policy add-scc-to-user privileged -z linkerd-grafana -n linkerd
> oc adm policy add-scc-to-user privileged -z linkerd-proxy-injector -n linkerd
> oc adm policy add-scc-to-user privileged -z linkerd-sp-validator -n linkerd
> oc adm policy add-scc-to-user privileged -z linkerd-tap -n linkerd
> oc adm policy add-scc-to-user privileged -z linkerd-web -n linkerd
> oc adm policy add-scc-to-user privileged -z linkerd-identity -n linkerd
> oc describe rolebinding.rbac -n linkerd
> linkerd install | oc apply -f -
> linkerd check

Logs, error output, etc

image

linkerd check output

your output here ...

Environment

[I] jkassis@Jeremys-MBP ~/c/c/live> linkerd version 08.07 12:03 Client version: stable-2.8.1 Server version: unavailable

Possible solution

Additional context

cpretzer commented 4 years ago

@jkassis thanks for the report. There have been some changes to the iptables logic in the last couple of weeks, so would you mind seeing if you can reproduce this with the latest edge release?

As long as you're not using CNI, this error shouldn't occur.

Also, can you tell me where you're running openshift? Locally or in the cloud?

jkassis commented 4 years ago

openshift in aws. ok i will try edge.

On Fri, Aug 7, 2020 at 2:48 PM cpretzer notifications@github.com wrote:

@jkassis https://github.com/jkassis thanks for the report. There have been some changes to the iptables logic in the last couple of weeks, so would you mind seeing if you can reproduce this with the latest edge release https://github.com/linkerd/linkerd2/releases/tag/edge-20.7.5?

As long as you're not using CNI, this error shouldn't occur.

Also, can you tell me where you're running openshift? Locally or in the cloud?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/linkerd/linkerd2/issues/4851#issuecomment-670726050, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WITFRIINXMRAZLGTAAGDR7RZDFANCNFSM4PX5POIQ .

grampelberg commented 4 years ago

Have you tried using CNI yet? AFAICT OpenShift 4.1+ uses nftables which would fail when combined with proxy-init. You'll also want to:

oc adm policy add-scc-to-group anyuid system:serviceaccounts:linkerd
oc adm policy add-scc-to-group privileged system:serviceaccounts:<application-ns>
oc adm policy add-scc-to-group anyuid system:serviceaccounts:<application-ns>
jkassis commented 4 years ago

i tried edge with the policy group scc additions you recommended...

` lastState: terminated: exitCode: 1 reason: Error message: >+ mp-port-unreachable

        -A OUTPUT -d 169.254.169.254/32 -p udp -m udp ! --dport 53 -j REJECT
        --reject-with icmp-port-unreachable

        COMMIT

        # Completed on Thu Aug 13 03:00:41 2020

        configuration

        ------------------------------------------------------------

        Will ignore port [4190 4191] on chain PROXY_INIT_REDIRECT

        Will redirect all INPUT ports to proxy

        Ignoring uid 2102

        Will ignore port [443] on chain PROXY_INIT_OUTPUT

        Redirecting all OUTPUT to 4140

        adding rules

        ------------------------------------------------------------

        :; iptables -t nat -N PROXY_INIT_REDIRECT -m comment --comment
        proxy-init/redirect-common-chain/1597287641

        iptables: Chain already exists.

        Aborting firewall configuration

        Error: exit status 1

        Usage:
          proxy-init [flags]

        Flags:
          -h, --help                               help for proxy-init
              --inbound-ports-to-ignore strings    Inbound ports and/or port ranges (inclusive) to ignore and not redirect to proxy. This has higher precedence than any other parameters.
          -p, --incoming-proxy-port int            Port to redirect incoming traffic (default -1)
              --netns string                       Optional network namespace in which to run the iptables commands
              --outbound-ports-to-ignore strings   Outbound ports and/or port ranges (inclusive) to ignore and not redirect to proxy. This has higher precedence than any other parameters.
          -o, --outgoing-proxy-port int            Port to redirect outgoing traffic (default -1)
          -r, --ports-to-redirect ints             Port to redirect to proxy, if no port is specified then ALL ports are redirected
          -u, --proxy-uid int                      User ID that the proxy is running under. Any traffic coming from this user will be ignored to avoid infinite redirection loops. (default -1)
              --simulate                           Don't execute any command, just print what would be executed
              --timeout-close-wait-secs int        Sets nf_conntrack_tcp_timeout_close_wait
          -w, --use-wait-flag                      Appends the "-w" flag to the iptables commands

      startedAt: '2020-08-13T03:00:41Z'
      finishedAt: '2020-08-13T03:00:41Z'
      containerID: >-
        cri-o://6bc4aa7e2ad6419849bb3915a6c4f1729b0235117abe00a332ca88f3dee55df3
  ready: false
  restartCount: 5
  image: 'gcr.io/linkerd-io/proxy-init:v1.3.4'
  imageID: >-
    gcr.io/linkerd-io/proxy-init@sha256:5e9ce6c12258bd398f7961961ffeb6dcc725e192a37c2d2a07e919b9a7ce3101
  containerID: 'cri-o://ff223f79b1601efa8ac81bf0ee2aa7b6eaf82ea1c94c98ef84c6c708a7e305bf'

`

jkassis commented 4 years ago

this is the amazon api service link local IP (https://stackoverflow.com/questions/42314029/whats-special-about-169-254-169-254-ip-address-for-aws)... and not surprising... is the reason i'm looking at linkerd in the first place.

the openshift sdn-cni-plugin hardcodes a rule (https://github.com/openshift/origin/blob/release-4.1/cmd/sdn-cni-plugin/openshift-sdn_linux.go#L129) to block this port, which is causing all kinds of horror for getting my app installed. it breaks kube2iam (https://github.com/jtblin/kube2iam), kiam (https://github.com/uswitch/kiam), and now... apparently linkerd, all of which make firewall rules to redirect traffic for this service.

i'm beginning to think the sdn-cni-plugin needs to be replaced with something else to fix all of these. i want to access to the AWS API with no B.S. from Openshift and I want to use linkerd for my service mesh.

what do i do?

jkassis commented 4 years ago

well. looks like i can migrate off of openshift-sdn... https://docs.openshift.com/container-platform/4.5/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html

jkassis commented 4 years ago

i ran the migration to OVN (https://docs.okd.io/latest/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html) and then the linkerd install again and ran into the same mp-port-unreachable error.

so... frankly... i'm confused why this would come up since i'm not running the sdn-cni-plugin theoretically.

here is the network status for the pod...

image

jkassis commented 4 years ago

here's the PR where they nerfed the AWS API... https://github.com/openshift/origin/pull/22826

jkassis commented 4 years ago

and here's a discussion of the AWS / GOOGLE choice to use link local address... https://stackoverflow.com/questions/42314029/whats-special-about-169-254-169-254-ip-address-for-aws. I'm also wondering how a link local address resolves to an amazon API. how does the AWS CLI on my Mac make a request through a link-local address to get to AWS API in the cloud?

jkassis commented 4 years ago

i destroyed my Openshift cluster and recreated it with Calico networking. getting this now...

  initContainerStatuses:
    - name: linkerd-init
      state:
        waiting:
          reason: CrashLoopBackOff
          message: >-
            back-off 10s restarting failed container=linkerd-init
            pod=linkerd-controller-588d778444-7dfc5_linkerd(946ae8fa-6b67-4794-b044-56079cdff6a6)
      lastState:
        terminated:
          exitCode: 1
          reason: Error
          message: >+
            o 4140

            2020/08/13 21:11:31 Executing commands:

            2020/08/13 21:11:31 > iptables -t nat -N PROXY_INIT_REDIRECT -m
            comment --comment proxy-init/redirect-common-chain/1597353091

            2020/08/13 21:11:31 < 

            2020/08/13 21:11:31 > iptables -t nat -A PROXY_INIT_REDIRECT -p tcp
            --match multiport --dports 4190,4191 -j RETURN -m comment --comment
            proxy-init/ignore-port-4190,4191/1597353091

            2020/08/13 21:11:31 < 

            2020/08/13 21:11:31 > iptables -t nat -A PROXY_INIT_REDIRECT -p tcp
            -j REDIRECT --to-port 4143 -m comment --comment
            proxy-init/redirect-all-incoming-to-proxy-port/1597353091

            2020/08/13 21:11:31 < iptables: No chain/target/match by that name.

            2020/08/13 21:11:31 Aborting firewall configuration

            Error: exit status 1

            Usage:
              proxy-init [flags]

            Flags:
              -h, --help                               help for proxy-init
                  --inbound-ports-to-ignore strings    Inbound ports and/or port ranges (inclusive) to ignore and not redirect to proxy. This has higher precedence than any other parameters.
              -p, --incoming-proxy-port int            Port to redirect incoming traffic (default -1)
                  --netns string                       Optional network namespace in which to run the iptables commands
                  --outbound-ports-to-ignore strings   Outbound ports and/or port ranges (inclusive) to ignore and not redirect to proxy. This has higher precedence than any other parameters.
              -o, --outgoing-proxy-port int            Port to redirect outgoing traffic (default -1)
              -r, --ports-to-redirect ints             Port to redirect to proxy, if no port is specified then ALL ports are redirected
              -u, --proxy-uid int                      User ID that the proxy is running under. Any traffic coming from this user will be ignored to avoid infinite redirection loops. (default -1)
                  --simulate                           Don't execute any command, just print what would be executed
                  --timeout-close-wait-secs int        Sets nf_conntrack_tcp_timeout_close_wait
              -w, --use-wait-flag                      Appends the "-w" flag to the iptables commands

          startedAt: '2020-08-13T21:11:31Z'
          finishedAt: '2020-08-13T21:11:31Z'
          containerID: >-
            cri-o://e9d662183ed980a73f6404dfbf503eb43731f7bb712a074597af2c27493e182e
      ready: false
      restartCount: 1
      image: 'gcr.io/linkerd-io/proxy-init:v1.3.3'
      imageID: >-
        gcr.io/linkerd-io/proxy-init@sha256:e9d5d020b84c80f964449d62ea509a45b9448655d3aecd7371e54d0acd42665a
      containerID: 'cri-o://e9d662183ed980a73f6404dfbf503eb43731f7bb712a074597af2c27493e182e'
jkassis commented 4 years ago

it's not complaining about the aws api address anymore, so maybe i've dodged that bullet with calico.

cpretzer commented 4 years ago

@jkassis thanks for the updates. It's taking me a bit longer to get the OpenShift set up, but I haven't forgotten about this.

MattPOlson commented 4 years ago

Have there been any work done on this. We are attempting to install on OpenShift version 4.5.9 UPI on vsphere and are running into the same issues. Thanks!

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

jkassis commented 3 years ago

@mattpolsen whatever you do, i recommend using calico as the network layer.

davidkarlsen commented 3 years ago

@MattPOlson @cpretzer did you ever get this to work on OCP? @jkassis did you get it to work on any other SDN than calico on OCP? Did you ever find out the underlying cause?

cpretzer commented 2 years ago

@MattPOlson @jkassis @davidkarlsen

I had a chance to spend some time with OpenShift and get Linkerd running on it. You can find the gist here.

Couple of notes about this:

Please give it a try and let us know how it goes

I'd also love your feedback about a different security approach. According to the docs, there is the notion of a system user, which sounds appropriate for the Linkerd control plane components. I haven't found any docs on how to go about creating one of those users and assigning it to the Linkerd components. If you all have any thoughts or know how to do that, your pointers would be helpful.

davidkarlsen commented 2 years ago

@cpretzer does this allow for the sidecars to run unprivileged (i.e. which mode does it run in). Do the webhooks need to be disabled for it to run on OCP (I see you turned them off) - or was that just preference?

cpretzer commented 2 years ago

@davidkarlsen this was OCP deployed to AWS (I doubt the provider matters, though).

I didn't make any changes to the proxy privileges, so they will have the default privileges on the pod that they're injected into. Here is the template for the proxy securityContext.

The webhook labels are necessary on the Linkerd control plane to prevent those pods from being injected, and those labels/annotations are taken directly from the default Linkerd YAML files.

cpretzer commented 2 years ago

@davidkarlsen one more thought on the privileges for this deployment is that the OpenShift SDN uses CNI, so the Linkerd CNI Plugin is appropriate for use here. Using the CNI Plugin delegates the responsibility of configuring iptables to the DaemonSet that is deployed by the CNI plugin. So, the Linkerd init container is no longer necessary on each of the meshed pods.

I hope this helps, and please let us know if you end up trying this out