hcloud-talos / terraform-hcloud-talos

This repository contains a Terraform module for creating a Kubernetes cluster with Talos in the Hetzner Cloud.
https://registry.terraform.io/modules/hcloud-talos/talos
MIT License
65 stars 16 forks source link

feat(cilium): added BPF/XDP and support for encryption #34

Closed M4t7e closed 1 month ago

M4t7e commented 3 months ago

This PR adds a few Cilium features to improve the network performance and security:

Before:

# kubectl exec -n kube-system ds/cilium -- cilium status --verbose
Host Routing: Legacy
Masquerading: IPTables [IPv4: Enabled, IPv6: Disabled]
[...]
KubeProxyReplacement Details:
  XDP Acceleration: Disabled
[...]
Encryption: Disabled

After:

# kubectl exec -n kube-system ds/cilium -- cilium status --verbose
Host Routing: BPF
Masquerading: BPF [eth0, eth1] 10.0.16.0/20 [IPv4: Enabled, IPv6: Disabled]
[...]
KubeProxyReplacement Details:
  XDP Acceleration: Native
[...]
Encryption: Wireguard [NodeEncryption: Disabled, cilium_wg0 (Pubkey: 70TbXMoHvbEtRmsvDck+io/bK1M+QA3y2cBczPWou08=, Port: 51871, Peers: 4)]

Info: endpointRoutes is excluded here due to https://github.com/cilium/cilium/issues/28812. Additionally, this configuration will automatically become the default in one of the upcoming releases of Cilium (https://github.com/cilium/cilium/issues/14955).

Force Cilium to apply changes: kubectl -n kube-system rollout restart ds/cilium

github-actions[bot] commented 3 months ago

Commitlint-Check

Thanks for your contribution :heart:

commitlint has detected that all commit messages in this PR follow the conventional commit format :tada:

github-actions[bot] commented 3 months ago

Terraform-Check: βœ…

πŸ–Œ Terraform Format: βœ… ``` # Outputs: # Errors: ```
βš™οΈ Terraform Init: βœ… ``` # Outputs: Initializing the backend... Initializing provider plugins... - Finding hashicorp/http versions matching ">= 3.4.2"... - Finding hashicorp/helm versions matching ">= 2.12.1"... - Finding gavinbunney/kubectl versions matching "1.14.0"... - Finding hashicorp/tls versions matching ">= 4.0.5"... - Finding hetznercloud/hcloud versions matching "1.47.0"... - Finding siderolabs/talos versions matching "0.5.0"... - Installing hashicorp/http v3.4.3... - Installed hashicorp/http v3.4.3 (signed by HashiCorp) - Installing hashicorp/helm v2.14.0... - Installed hashicorp/helm v2.14.0 (signed by HashiCorp) - Installing gavinbunney/kubectl v1.14.0... - Installed gavinbunney/kubectl v1.14.0 (self-signed, key ID AD64217B5ADD572F) - Installing hashicorp/tls v4.0.5... - Installed hashicorp/tls v4.0.5 (signed by HashiCorp) - Installing hetznercloud/hcloud v1.47.0... - Installed hetznercloud/hcloud v1.47.0 (signed by a HashiCorp partner, key ID 5219EACB3A77198B) - Installing siderolabs/talos v0.5.0... - Installed siderolabs/talos v0.5.0 (signed by a HashiCorp partner, key ID AF0815C7E2EC16A8) Partner and community providers are signed by their developers. If you'd like to know more about provider signing, you can read about it here: https://www.terraform.io/docs/cli/plugins/signing.html Terraform has created a lock file .terraform.lock.hcl to record the provider selections it made above. Include this file in your version control repository so that Terraform can guarantee to make the same selections by default when you run "terraform init" in the future. Terraform has been successfully initialized! You may now begin working with Terraform. Try running "terraform plan" to see any changes that are required for your infrastructure. All Terraform commands should now work. If you ever set or change modules or backend configuration for Terraform, rerun this command to reinitialize your working directory. If you forget, other commands will detect it and remind you to do so if necessary. # Errors: ```
πŸ€– Terraform Validate: βœ… ``` # Outputs: Success! The configuration is valid. # Errors: ```
mrclrchtr commented 3 months ago

Hi @M4t7e, thank you very much for this PR!

Could you please add a DCO to your commit? (https://github.com/hcloud-talos/terraform-hcloud-talos/pull/34/checks?check_run_id=26549766936)

mrclrchtr commented 3 months ago

And maybe we should use best-effort for loadBalancer.acceleration?

acceleration is the option to accelerate service handling via XDP Applicable values can be: disabled (do not use XDP), native (XDP BPF program is run directly out of the networking driver’s early receive path), or best-effort (use native mode XDP acceleration on devices that support it).

mrclrchtr commented 3 months ago

I played around with it a bit today. Unfortunatelybpf.masquerade=true caused that coredns could no longer connect to external DNS servers... i/o timout. Do you have any experience with this?

M4t7e commented 3 months ago

Hi @mrclrchtr, thanks for the review! :slightly_smiling_face:

And maybe we should use best-effort for loadBalancer.acceleration?

XDP should always be supported by Talos and Hetzner VMs but it would also not hurt to use best-effort instead. If you prefer that, I can change it.

I played around with it a bit today. Unfortunatelybpf.masquerade=true caused that coredns could no longer connect to external DNS servers... i/o timout. Do you have any experience with this?

Good catch! It seems the Talos forwardKubeDNSToHost feature does not work together with Cilium BPF based routing (enabled by bpf.masquerade=true). See: https://github.com/siderolabs/talos/issues/8836 I would rather deactivate forwardKubeDNSToHost as it brings just very limited benefits and on the other hand side the BPF based routing is the no 1 selling point and performance boost of Cilium.

Could you please add a DCO to your commit? (https://github.com/hcloud-talos/terraform-hcloud-talos/pull/34/checks?check_run_id=26549766936)

I'm not very into that DCO stuff as I see it a bit redundant and tbh annoying here on Github. I would also not put any real name or working mail address into it. Additionally many projects interpret it differently and I have seen no definition in this project. From legal/license perspective it means nothing than there is a sign-off in the commit message, whatever that means.

mrclrchtr commented 3 months ago

XDP should always be supported by Talos and Hetzner VMs but it would also not hurt to use best-effort instead. If you prefer that, I can change it.

Yes, you are right, we can leave it as is.

Good catch! It seems the Talos forwardKubeDNSToHost feature does not work together with Cilium BPF based routing (enabled by bpf.masquerade=true). See: siderolabs/talos#8836 I would rather deactivate forwardKubeDNSToHost as it brings just very limited benefits and on the other hand side the BPF based routing is the no 1 selling point and performance boost of Cilium.

Hmm.. yesterday I also disabled forwardKubeDNSToHost because I already thought that was the reason. But then unfortunately there were also i/o timeouts in the direction of the Hetzner DNS server. I could see from the IPs that these were requested. I then even switched to cloudflare DNS server. There were also i/o timeouts.

Maybe I have to try again and restart coredns.

I'm not very into that DCO stuff as I see it a bit redundant and tbh annoying here on Github. I would also not put any real name or working mail address into it. Additionally many projects interpret it differently and I have seen no definition in this project. From legal/license perspective it means nothing than there is a sign-off in the commit message, whatever that means.

Ok, I think you are right. Maybe I should disable the check.

M4t7e commented 3 months ago

Hmm.. yesterday I also disabled forwardKubeDNSToHost because I already thought that was the reason. But then unfortunately there were also i/o timeouts in the direction of the Hetzner DNS server. I could see from the IPs that these were requested. I then even switched to cloudflare DNS server. There were also i/o timeouts.

How did you find the timeouts? I tried to reproduce them, but so far everything looks fine. By the way, are you using a K8s version compatible with Cilium? I think the current default with Talos 1.7 is K8s 1.30, and Cilium 1.15 is not compatible with K8s 1.30. They will add support in the upcoming 1.16 release.

mrclrchtr commented 3 months ago

How did you find the timeouts? I tried to reproduce them, but so far everything looks fine.

I had timeouts error logs directly in coreDNS.

By the way, are you using a K8s version compatible with Cilium? I think the current default with Talos 1.7 is K8s 1.30, and Cilium 1.15 is not compatible with K8s 1.30. They will add support in the upcoming 1.16 release.

That's a good point! I wasn't aware of that. Yes, I have an incompatible version... I will test the RC.

I definitely want to merge this PR, but I would like to test further to get more certainty that everything is working.

github-actions[bot] commented 2 months ago

Terraform-Check (version: 1.8.5): βœ…

πŸ–Œ Terraform Format: βœ… ``` # Outputs: # Errors: ```
βš™οΈ Terraform Init: βœ… ``` # Outputs: Initializing the backend... Initializing provider plugins... - Finding hetznercloud/hcloud versions matching ">= 1.48.0"... - Finding siderolabs/talos versions matching ">= 0.5.0"... - Finding hashicorp/http versions matching ">= 3.4.4"... - Finding hashicorp/helm versions matching ">= 2.14.0"... - Finding gavinbunney/kubectl versions matching ">= 1.14.0"... - Finding hashicorp/tls versions matching ">= 4.0.5"... - Installing hetznercloud/hcloud v1.48.0... - Installed hetznercloud/hcloud v1.48.0 (signed by a HashiCorp partner, key ID 5219EACB3A77198B) - Installing siderolabs/talos v0.5.0... - Installed siderolabs/talos v0.5.0 (signed by a HashiCorp partner, key ID AF0815C7E2EC16A8) - Installing hashicorp/http v3.4.4... - Installed hashicorp/http v3.4.4 (signed by HashiCorp) - Installing hashicorp/helm v2.14.0... - Installed hashicorp/helm v2.14.0 (signed by HashiCorp) - Installing gavinbunney/kubectl v1.14.0... - Installed gavinbunney/kubectl v1.14.0 (self-signed, key ID AD64217B5ADD572F) - Installing hashicorp/tls v4.0.5... - Installed hashicorp/tls v4.0.5 (signed by HashiCorp) Partner and community providers are signed by their developers. If you'd like to know more about provider signing, you can read about it here: https://www.terraform.io/docs/cli/plugins/signing.html Terraform has created a lock file .terraform.lock.hcl to record the provider selections it made above. Include this file in your version control repository so that Terraform can guarantee to make the same selections by default when you run "terraform init" in the future. Terraform has been successfully initialized! You may now begin working with Terraform. Try running "terraform plan" to see any changes that are required for your infrastructure. All Terraform commands should now work. If you ever set or change modules or backend configuration for Terraform, rerun this command to reinitialize your working directory. If you forget, other commands will detect it and remind you to do so if necessary. # Errors: ```
πŸ€– Terraform Validate: βœ… ``` # Outputs: Success! The configuration is valid. # Errors: ```
github-actions[bot] commented 2 months ago

Terraform-Check (version: 1.9.3): βœ…

πŸ–Œ Terraform Format: βœ… ``` # Outputs: # Errors: ```
βš™οΈ Terraform Init: βœ… ``` # Outputs: Initializing the backend... Initializing provider plugins... - Finding hashicorp/http versions matching ">= 3.4.3"... - Finding hashicorp/helm versions matching ">= 2.14.0"... - Finding gavinbunney/kubectl versions matching ">= 1.14.0"... - Finding hashicorp/tls versions matching ">= 4.0.5"... - Finding hetznercloud/hcloud versions matching ">= 1.48.0"... - Finding siderolabs/talos versions matching ">= 0.5.0"... - Installing hashicorp/http v3.4.3... - Installed hashicorp/http v3.4.3 (signed by HashiCorp) - Installing hashicorp/helm v2.14.0... - Installed hashicorp/helm v2.14.0 (signed by HashiCorp) - Installing gavinbunney/kubectl v1.14.0... - Installed gavinbunney/kubectl v1.14.0 (self-signed, key ID AD64217B5ADD572F) - Installing hashicorp/tls v4.0.5... - Installed hashicorp/tls v4.0.5 (signed by HashiCorp) - Installing hetznercloud/hcloud v1.48.0... - Installed hetznercloud/hcloud v1.48.0 (signed by a HashiCorp partner, key ID 5219EACB3A77198B) - Installing siderolabs/talos v0.5.0... - Installed siderolabs/talos v0.5.0 (signed by a HashiCorp partner, key ID AF0815C7E2EC16A8) Partner and community providers are signed by their developers. If you'd like to know more about provider signing, you can read about it here: https://www.terraform.io/docs/cli/plugins/signing.html Terraform has created a lock file .terraform.lock.hcl to record the provider selections it made above. Include this file in your version control repository so that Terraform can guarantee to make the same selections by default when you run "terraform init" in the future. Terraform has been successfully initialized! You may now begin working with Terraform. Try running "terraform plan" to see any changes that are required for your infrastructure. All Terraform commands should now work. If you ever set or change modules or backend configuration for Terraform, rerun this command to reinitialize your working directory. If you forget, other commands will detect it and remind you to do so if necessary. # Errors: ```
πŸ€– Terraform Validate: βœ… ``` # Outputs: Success! The configuration is valid. # Errors: ```
mrclrchtr commented 2 months ago

With cilium 1.16 and k8s 1.30 everything looks good to me. forwardKubeDNSToHost seems to work,too. @M4t7e do you have any complaints? Otherwise we can merge.

github-actions[bot] commented 1 month ago

Terraform-Check (version: 1.9.4): βœ…

πŸ–Œ Terraform Format: βœ… ``` # Outputs: # Errors: ```
βš™οΈ Terraform Init: βœ… ``` # Outputs: Initializing the backend... Initializing provider plugins... - Finding gavinbunney/kubectl versions matching ">= 1.14.0"... - Finding hashicorp/tls versions matching ">= 4.0.5"... - Finding hetznercloud/hcloud versions matching ">= 1.48.0"... - Finding siderolabs/talos versions matching ">= 0.5.0"... - Finding hashicorp/http versions matching ">= 3.4.4"... - Finding hashicorp/helm versions matching ">= 2.14.0"... - Installing hashicorp/tls v4.0.5... - Installed hashicorp/tls v4.0.5 (signed by HashiCorp) - Installing hetznercloud/hcloud v1.48.0... - Installed hetznercloud/hcloud v1.48.0 (signed by a HashiCorp partner, key ID 5219EACB3A77198B) - Installing siderolabs/talos v0.5.0... - Installed siderolabs/talos v0.5.0 (signed by a HashiCorp partner, key ID AF0815C7E2EC16A8) - Installing hashicorp/http v3.4.4... - Installed hashicorp/http v3.4.4 (signed by HashiCorp) - Installing hashicorp/helm v2.14.0... - Installed hashicorp/helm v2.14.0 (signed by HashiCorp) - Installing gavinbunney/kubectl v1.14.0... - Installed gavinbunney/kubectl v1.14.0 (self-signed, key ID AD64217B5ADD572F) Partner and community providers are signed by their developers. If you'd like to know more about provider signing, you can read about it here: https://www.terraform.io/docs/cli/plugins/signing.html Terraform has created a lock file .terraform.lock.hcl to record the provider selections it made above. Include this file in your version control repository so that Terraform can guarantee to make the same selections by default when you run "terraform init" in the future. Terraform has been successfully initialized! You may now begin working with Terraform. Try running "terraform plan" to see any changes that are required for your infrastructure. All Terraform commands should now work. If you ever set or change modules or backend configuration for Terraform, rerun this command to reinitialize your working directory. If you forget, other commands will detect it and remind you to do so if necessary. # Errors: ```
πŸ€– Terraform Validate: βœ… ``` # Outputs: Success! The configuration is valid. # Errors: ```
mrclrchtr commented 1 month ago

This configuration has been running for 2 weeks without any problems. That's why I think it looks good. Thanks again!

mrclrchtr commented 1 month ago

:tada: This PR is included in version 2.10.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket:

mrclrchtr commented 1 month ago

There were problems again... unfortunately I can't figure out why and how to solve them... I revert to masquerade: false

https://github.com/hcloud-talos/terraform-hcloud-talos/commit/b62ce697302f86687b5980d6c8dfc20fb772f251

Rhymen commented 1 month ago

What problems did you observe @mrclrchtr? I tried a similar configuration and can still see the timeout logs in coreDNS.

mrclrchtr commented 1 month ago

Yes I observed the timeout logs πŸ˜•