istio / istio

Connect, secure, control, and observe services.
https://istio.io
Apache License 2.0
35.39k stars 7.63k forks source link

Allow the use of hostNetwork=true for pilot so webhooks can be accessing using different a different CNI in EKS and other cloud based k8s distros #45738

Closed geoffo-dev closed 1 month ago

geoffo-dev commented 1 year ago

(This is used to request new product features, please visit https://discuss.istio.io for questions on using Istio)

Describe the feature request

I am currently trying to deploy istio alongside Weave CNI on an EKS cluster. This is primarily because I like istio, but also need to use Weave as I understand it is one of the only CNIs that allow Pods to communicate using multicast internally.

I have had istio working fine using the aws vpc cni, but since moving to weave, I get the following error when trying to setup the gateway:

Error creating: Internal error occurred: failed calling webhook "namespace.sidecar-injector.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": Address is not allowed

As I understand it, this is a limitation with EKS as weave cannot be installed on the control plane and therefore cannot reference pods within the cluster unless they expose their port on the host network.

I am not sure whether this is a simple change to the helm chart to deploy the pilot pod with hostNetwork: true within the spec, but I get the error:

a"2023-06-29T17:43:32.095327Z   info    validationController    Not ready to switch validation to fail-closed: dummy invalid config not rejected
2023-06-29T17:43:32.095354Z info    validationController    validatingwebhookconfiguration istiod-default-validator (failurePolicy=Ignore, resourceVersion=1111127) is up-to-date. No change required.

It might be a really simple fix... it might not.. and I appreciate it might be an edge case... but it would be good to get this working alongside other cni's in cloud based k8s distros...

Describe alternatives you've considered

AWS CNI does not work due to it not being supported... I am going to make the wild and dangerous assessment that this is going to be the same with any CNI provider in something like EKS.

Affected product area (please put an X in all that apply)

[ ] Ambient [ ] Docs [ ] Installation [X] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[X] Multi Cluster [ ] Virtual Machine [ ] Multi Control Plane

Additional context

None

howardjohn commented 1 year ago

I don't think we should do this. Running as hostNetwork is not safe and I don't support encouraging it. You can do this on your own if you want -- and are willing to deal with the repercussions. See https://github.com/istio/istio/issues/34021 for prior discussion and link for how to to do it.

geoffo-dev commented 1 year ago

Hey @howardjohn, thanks for the quick response! I do understand, but it makes it kind of difficult to adopt anything that doesn't chain to the AWS CNI... Our cluster isn't publicly facing and we already have to use this on the OperatorHub Deployment and Metrics Server as they have a similar issue.

If it is as simple as that - would you accept a PR which would highlight in the values file that setting this value to 'true' incurs risk? If possible it would be good to bake this into the helm chart itself as I'd like to do the deployment through argocd (as part of the cluster bootstrap) and post-render is a bit difficult to achieve in this way.

ChrisJBurns commented 11 months ago

I guess the wider question is, what is the recommended way of installing Istio on EKS if users opt for choosing a different CNI than provided AWS VPC CNI.

geoffo-dev commented 11 months ago

As far as I know it is the only way... EKS uses their own plugin to run over the underlying VPC and therefore this is what both the control plane and the nodes would use.

If you remove the AWS CNI from the nodes, then the only way to expose the API endpoint it is to use hostNetwork=true as I understand it.

Again I appreciate that this is an edge case... but we needed to use multicast (which AWS VPC doesn't support) and therefore had to use a custom CNI.

ChrisJBurns commented 11 months ago

Yeah, it's also quite a popular strategy for a lot of organisations to use a more advanced and flexible CNI like Calico etc, so if the only way of getting Istio to work with it is by modifying the hostNetwork option, then I guess that's just because of the way EKS mandates it. It's still not ideal, and I can see why it's not recommended by Istio itself, but it is somewhat the only way of running a custom CNI on EKS.

apryiomka commented 10 months ago

I have the same problem and patched istiod deployment manually setting hostNetwork=true. Unfortunately, this didn't fix the problem. Any advice what else to try? I see two errors now, one in istiod log:

error controllers error handling /istio-validator-istio-system, retrying (retry count: 5089): webhook is not ready, retry controller=validation another in replacsets trying to inject the proxy: replicaset-controller Error creating: Internal error occurred: failed calling webhook "namespace.sidecar-injector.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": context deadline exceeded

geoffo-dev commented 10 months ago

so have you opened the ports on the nodes? I dont think 443 is enabled by default depending on the kubernetes distro.

linsun commented 7 months ago

check this out - https://istio.io/latest/docs/setup/install/external-controlplane/#set-up-a-gateway-in-the-external-cluster you can put istiod behind an istio gateway so the mutated webhook can use this new endpoint url via istio gateway

Can you try this approach to see if it works @geoffo-dev ?

linsun commented 7 months ago

Note: i'm recommending this as I know this approach for other cloud providers when Istiod runs outside of cluster

obervinov commented 7 months ago

The same problem with EKS + cilium. This situation makes it difficult to operate istio in non-stock systems. Of course, I can patch the pilot's deployment, but this creates certain overhead costs for supporting such a solution.

igor-nikiforov commented 7 months ago

I don't think we should do this. Running as hostNetwork is not safe and I don't support encouraging it. You can do this on your own if you want -- and are willing to deal with the repercussions. See https://github.com/istio/istio/issues/34021 for prior discussion and link for how to to do it.

@howardjohn, sorry but I really don't understand about what repercussions you are talking about. In EKS users decide which ports are allowed from EKS control plane to nodes using security groups. Since EKS control plane is NOT directly available for users except API, allowing another random ports from control plane to nodes are secure.

Sad reality is that If you have big EKS clusters using default VPC CNI is not an option. This CNI have many restrictions / issues and only option is use Cilium/Calico/Weave. Moreover hostNetwork mode approach is a widely adopted solution in most popular Kubernetes projects which is using webhooks in case of using these CNIs:

  1. cert-manager - https://github.com/cert-manager/cert-manager/blob/main/deploy/charts/cert-manager/values.yaml#L477-L486
  2. Karpenter - https://github.com/aws/karpenter/blob/main/charts/karpenter/values.yaml#L52-L53
  3. kube-prometheus-stack https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml#L2323-L2326
  4. AWS Load Balancer Controller - https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/helm/aws-load-balancer-controller/values.yaml#L261-L264
  5. metics server - https://github.com/kubernetes-sigs/metrics-server/blob/master/charts/metrics-server/values.yaml#L69-L75

These rules as also adopted in very poplar Terraform module for EKS boostrap - https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/node_groups.tf#L137-L172

For sure, hostNetwork must set to false by default but users should have opportunity to override this depend on their environment. Because if the user touches this option he usually knows what he is doing.

I really hope that Istio maintancers reconsider this or let me know if we can just open a PR for this.

Thanks!

howardjohn commented 7 months ago

You are free to set any field you want in the Kubernetes API with almost any install or CI/CD tool (https://istio.io/latest/docs/setup/additional-setup/customize-installation-helm/ etc).

igor-nikiforov commented 7 months ago

@howardjohn sure I can, as well as take responsibility and maintain own fork of chart to avoid limitations. But we are talking here about wide adopted approach for a very popular problem and I really don't understand the motivation why it doesn't want to be included in official helm chart.

howardjohn commented 7 months ago

hostNetwork is an operational and security risk. I don't know the motivation behind other projects adopting it, but to me personally I don't want to in any way encourage users to run in that setup.

howardjohn commented 7 months ago

If you must run on an effectively broken Kubernetes distribution, I would put the webhook behind a load balancer that is accessible to the control plane before resorting to hostNetwork. But again, you don't have to convince me as you can simply override the value :-)

linsun commented 7 months ago

@igor-nikiforov @obervinov - can you pls explore and let us know if you can expose mutating webhook server to istio ingress gw and use that approach instead in EKS?

obervinov commented 7 months ago

@linsun, thank you for addressing our question.

It seems that, in this situation, a more cost-effective solution for us would be to fork the Istio Helm chart and introduce support for the hostNetwork parameter, enabling Istio to function correctly in EKS. While this would lead us to diverge from the official Helm chart for Istio, it aligns with a similar approach seen in other official Helm charts utilizing hooks (such as karpenter, cert-manager, prometheus, etc.).

The solution involving the addition of an extra component in the form of a web server (or Ingress), solely to facilitate Istio's deployment in EKS, appears redundant and more cumbersome to maintain.

Unfortunately, abstaining from adding this parameter to the official Istio Helm chart does not eliminate the use of hostNetwork; it merely complicates the support and operation of Istio in EKS with a non-standard CNI plugin. Those who use hostNetwork for hooks in EKS are likely to persist with its usage simply because it is the most straightforward solution and is supported in almost all Helm charts where hooks are present.

howardjohn commented 7 months ago

https://istio.io/latest/docs/setup/additional-setup/customize-installation-helm/ (or many other approaches) can modify a chart, without forking.

appears redundant and more cumbersome to maintain.

It is. A hostNetwork is a security and operational risk. Given you are running on a Kubernetes platform that doesn't support a core v1 API properly, unfortunately you need to compromise in one area.

Unfortunately, abstaining from adding this parameter to the official Istio Helm chart does not eliminate the use of hostNetwork;

My intent is to push users towards using a proxy in front of Istiod rather than using hostNetwork. Or, in failing that, ensure users are very very aware that what they are doing is considered "Bad" and not be surprised when issues arise. But again, you don't need to agree with me or convince me! you can always just set hostNetwork=true by a dozen of tools. Expecting a helm chart to expose every field in k8s you may need to set (of which there are over 1000 just in Pod) for whatever reason is not realistic. Note while you may only have an issue with this one field, there are 100s of others requesting "just 1 more field". This is exactly why we have pushed for https://istio.io/latest/docs/setup/additional-setup/customize-installation-helm/.

sambo2021 commented 6 months ago

hello just FYI, even setting host network: true did not help image

cccsss01 commented 6 months ago

@howardjohn @linsun

"My intent is to push users towards using a proxy in front of Istiod rather than using hostNetwork." I 100% believe in this as well, however, I currently don't see it feasible with fedora 34 distro, kubeadm, and cilium.

This documentation https://istio.io/latest/docs/setup/install/multicluster/primary-remote_multi-network/ Only works if I expose the primary istiod w/ host network.

"I would put the webhook behind a load balancer that is accessible to the control plane before resorting to hostNetwork."

Unless i'm mistaken I believe these instructions are putting the webhook behind a load balancer.

Any additional information or direction to documentation/code is always appreciated.

geoffo-dev commented 6 months ago

hello just FYI, even setting host network: true did not help image

Not sure where you are deploying it, but you might have to open the port on the nodes as well to allow the traffic through.

cccsss01 commented 6 months ago

hello just FYI, even setting host network: true did not help image

Not sure where you are deploying it, but you might have to open the port on the nodes as well to allow the traffic through.

from another host run nmap -p (nodeport port#) hostip, gateway needs to be deployed and potentially vs to open those. (which i find weird)

geoffo-dev commented 6 months ago

I appreciate that you want to try and keep this solution as secure as possible - I am a little frustrated that we have had to resort to the use of hostNetwork for a number of applications. However we only have to do this because that we are unable to control the networks on the K8s hosts which is a result of using cloud provided K8s solutions - not istio's problem I understand!

@howardjohn as you mentioned there are many ways to do this and as you say we could use kustomize to do this... There is a bit of faff to do, but fully appreciate that you can. Whichever way we choose, I understand doing it this way would considered bad... but unless we can control the network between both the masters and workers (which is unlikely) it is really the only option. We could implement a proxy, but I am sure that would come with its own challenges.

I think what we recognise through is that many people need to use different CNIs with Cloud enabled K8s distros... CertManager, External Secrets, OPA Gatekeeper, metrics server, nginx ingress, etc all offer the hostNetwork option.

I also appreciate there are 100s of other requests - but this one came with a PR which implemented it...

KrisJohnstone commented 6 months ago

@howardjohn, ootb argocd has no ability to kustomize a helm chart.

Further, during kubecon my boss was assured that this contribution would be 'welcomed'.

I was doing a quick search before I reached out, but I was going to propose using hostPort which would enable the webhook to be exposed but the other ports to not be (to the best of my knowledge anyways). Would this be acceptable?

PetrMc commented 5 months ago

there is a related request that is opened with AWS. Please upvote.

MartinKaburu commented 5 months ago

I found that my issue was from not allowing traffic through port 15017 and updating the cluster firewall as described here solved the problem.

KrisJohnstone commented 5 months ago

Here's the code that I'm using to get this working in ArgoCD: https://gist.github.com/KrisJohnstone/e263e5dfdf3a6ec29cbdf822992eaf01

linsun commented 5 months ago

Nice thank you @MartinKaburu and @KrisJohnstone for reporting back that updating firewall rules can fix this. are you still required to set hostNetwork to true or not for AWS cloud?

cc a few parties who may be interested: @nmnellis @ilrudie @danehans

KrisJohnstone commented 5 months ago

Nice thank you @MartinKaburu and @KrisJohnstone for reporting back that updating firefox can fix this. are you still required to set hostNetwork to true or not for AWS cloud?

cc a few parties who may be interested: @nmnellis @ilrudie @danehans

Think you have misunderstood the problem or @MartinKaburu's comment has thrown you off. (I'm also assuming that firefox was meant to be firewall and was a brain fart).

If your security groups are too restrictive then traffic to the istio pod would be blocked and thus webhooks calls would fail. Given it's Martins first post in the thread and the context of his post I'm guessing hes using VPC CNI or a CNI that makes use of AWS ENI, they effectively share the same networking concepts. The advantage is that you do not need to use hostNetworking or hostPort functionality. There are a number of disadvantages with either.

Alternatively, if you choose to remove the default CNI (VPC CNI) and install another CNI that doesn't utilise AWS ENI then because Calico (insert CNI here) pods cant be run on the master nodes, changes need to be made to items such as webhooks to enable communication with control plane nodes.

The common pattern that's suggested is using hostNetwork: true there is three(?) main issues with this:

  1. By enabling hostNetwork the pods networking is indistinguishable from that of the host.
  2. Security issues around abstract sockets (this was fixed in containerd but there are other processes on the host that might still run them).
  3. k8s network policies don't apply to resources running at the host level.

This is why I was suggesting using hostPort instead of hostNetwork. In doing so the pods aren't run at a host level so thereby negating 1, 2 and partially 3. Instead traffic is fowarded from the host to the container.

AFAIK this represents the best middle ground.

linsun commented 5 months ago

Thanks @KrisJohnstone, sorry about typing firefox instead of firewall rules, it was on Sunday night. :-(

Thank you for clarifying the issue is not with the default EKS CNI, but with different CNI and the hostPort vs hostNetwork. I agree hostPort seems a very nice middle ground as long as we can guarantee that port (15017 here) is not used by anything else on the hosts where istiod is placed.

linsun commented 5 months ago

cc @stevenctl as you are looking into chaining CNIs and EKS

svz-ya commented 2 months ago

Here's the code that I'm using to get this working in ArgoCD: https://gist.github.com/KrisJohnstone/e263e5dfdf3a6ec29cbdf822992eaf01

@KrisJohnstone The link seems to be broken. Please share your solution.

istio-policy-bot commented 1 month ago

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2024-01-30. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

MarcoDelGamba commented 1 month ago

At the end we ended up with the exact same change proposed by @geoffo-dev

We were forced to substitute aws cni with calico due to the fact that we cannot increase the subnets cidrs. (which happens to be a common issue when multiple teams are involved)

We discarded the hostPort hypothesis since the port should be opened only on the node where the pod is scheduled and this is suitable only when you have daemon sets talking with the underlying host.

We also tried to put an nginx ingress controller in front of the the istio control plane however the ingress hostname should be public and this is not nice either.

If the hostNetwork is not going to be supported I think it should be defined how to workaround this issue (as @ChrisJBurns already said earlier)