traefik / mesh

Traefik Mesh - Simpler Service Mesh
https://traefik.io/traefik-mesh
Apache License 2.0
2.02k stars 141 forks source link

Unable to install Maesh on AWS EKS v1.17 due to a CoreDNS issue #773

Closed 0rax closed 3 years ago

0rax commented 3 years ago

Bug Report

What did you do?

Installed traefik-maesh from Helm on a AWS EKS v1.17 (eks.3) cluster with Calico networing using

helm repo add traefik-mesh https://helm.traefik.io/mesh
helm repo update
helm install traefik-mesh traefik-mesh/traefik-mesh

What did you expect to see?

I was expecting the controller to start and maesh to be working.

What did you see instead?

The traefik-maesh-controller pod went into CrashLoopBackOff due to an issue with the traefik-maesh-prepare container. The issue seems to be linked to the "CoreDNS" version not being compatible with maesh though it should be (CoreDNS 1.3+).

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  13m                   default-scheduler  Successfully assigned default/traefik-mesh-controller-5f48ff8f69-vrbd9 to xxx.compute.internal
  Normal   Pulled     11m (x5 over 13m)     kubelet            Container image "traefik/mesh:v1.4.0" already present on machine
  Normal   Created    11m (x5 over 13m)     kubelet            Created container traefik-mesh-prepare
  Normal   Started    11m (x5 over 13m)     kubelet            Started container traefik-mesh-prepare
  Warning  BackOff    2m51s (x49 over 13m)  kubelet            Back-off restarting failed container

Output of prepare container log: (traefik/mesh:v1.4.0)

2020/10/28 19:16:35 command prepare error: unable to find suitable DNS provider: unsupported CoreDNS version "1.6.6-eksbuild.1"

What is your environment & configuration (arguments, provider, platform, ...)?

jspdown commented 3 years ago

@0rax Thanks for your interest in Traefik Mesh!

It appears that the issue comes from one of our dependencies: https://github.com/hashicorp/go-version. Before patching the DNS configuration we make sure CoreDNS is between >= 1.3 and < 1.8. But go-version constrains considers that a version with a pre-release never matches with a constrain specified without a pre-release.

An issue is already open on their repository to understand why it behave like this: https://github.com/hashicorp/go-version/issues/59

Until this get sorted, we can replace the goversion.NewConstraint(">= 1.3, < 1.8") by a version.GreaterThanOrEqual and version.LessThan. In this type of comparison pre-releases are handled correctly.

0rax commented 3 years ago

Thank you for your quick answer, seems like an issue that could be easily fixed.

I will try to build a custom version of the docker-image with this fix to properly check Maesh compatibility with my setup.

jspdown commented 3 years ago

@0rax Could you base your changes on v1.4? Since it's a bug fix it would be great to have it on this version. Don't hesitate to ping me if you need help on this.

0rax commented 3 years ago

It looks like that using this patch on top of refs/tags/v1.4.0 I was able to start traefik-mesh successfully.

diff --git a/pkg/dns/dns.go b/pkg/dns/dns.go
index c62d46d..0416b87 100644
--- a/pkg/dns/dns.go
+++ b/pkg/dns/dns.go
@@ -39,7 +39,11 @@ const (
        traefikMeshBlockTrailer = "#### End Traefik Mesh Block"
 )

-var versionCoreDNS17 = goversion.Must(goversion.NewVersion("1.7"))
+var (
+       versionCoreDNS17 = goversion.Must(goversion.NewVersion("1.7"))
+       versionCoreDNS13 = goversion.Must(goversion.NewVersion("1.3"))
+       versionCoreDNS18 = goversion.Must(goversion.NewVersion("1.8"))
+)

 // Client holds the client for interacting with the k8s DNS system.
 type Client struct {
@@ -103,7 +107,7 @@ func (c *Client) coreDNSMatch(ctx context.Context) (bool, error) {
                return false, err
        }

-       if !versionConstraint.Check(version) {
+       if !(version.GreaterThanOrEqual(versionCoreDNS13) && version.LessThan(versionCoreDNS18)) {
                c.logger.Debugf("CoreDNS version is not supported, must satisfy %q, got %q", versionConstraint, version)

                return false, fmt.Errorf("unsupported CoreDNS version %q", version)

Quick note, I just had to create a namespace myself as the current helm chart seems to install it in the default namespace by default, this seams inconsistent with the documentation available here https://doc.traefik.io/traefik-mesh/install/#verify-your-installation where it says to check the installation using the traefik-mesh namespace.


For people interested about how I was able to deploy it after patching the code, I had to launch the following commands:

make
docker tag traefik/mesh:latest XXXXXXX.dkr.ecr.eu-west-3.amazonaws.com/traefik-mesh:v1.4.0-eks
docker push XXXXXXX.dkr.ecr.eu-west-3.amazonaws.com/traefik-mesh:v1.4.0-eks
echo "---
apiVersion: v1
kind: Namespace
metadata:
    name: traefik-mesh" | kubectl apply -f -
helm install traefik-mesh traefik-mesh/traefik-mesh \
    --set controller.image.pullPolicy=IfNotPresent \
    --set controller.image.name=XXXXXXX.dkr.ecr.eu-west-3.amazonaws.com/traefik-mesh \
    --set controller.image.tag=v1.4.0-eks \
    --namespace=traefik-mesh
jspdown commented 3 years ago

@0rax This patch sounds good :+1:

Could you please open a Pull Request to contribute the changes upstream? We will make sure to release a patch version on the v1.4.

Thanks again for your time on this.

0rax commented 3 years ago

@jspdown Just pushed it, I took the liberty to rename global variables to something that better match what they do instead of what they are and added a test case reflecting this issue.

traefiker commented 3 years ago

Closed by #774.