linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.47k stars 1.26k forks source link

Traefik Router unable to communicate with meshed services when linkerd inbound policy is all-authenticated. #12754

Open palashbasik opened 1 week ago

palashbasik commented 1 week ago

What is the issue?

I installed linkerd via helm chart in on-prem K3S cluster in linkerd namespace. I am using traefik ingress-controller which is deployed in traefik namespace. I have few microservices deployed in default namespace. I configured traefik router to access microservices from outside of cluster. eg: Traefik Router

routers:
  example-service:
  entryPoints:
    - websecure
  rule: "Host(`example.app.com`) && PathPrefix(`/`)"
  tls:
    certResolver: leresolver
  service: example-service
services:
  example-service: 
    loadBalancer:
      servers:
        - url: http://example-service.default.svc.cluster.local:8000

When linkerd deployed with defaultInboundPolicy: "all-unauthenticated", I can access all the microservices from browser.

proxy:
  defaultInboundPolicy: "all-unauthenticated"

But, When deployed with defaultInboundPolicy: "all-authenticated", I can't access microservices from browser.

I am new to linkerd service mesh. I am unsure of the problem mentioned above.

How can it be reproduced?

  1. Provision K3S cluster.
  2. Install traefik in traefik namespace and annotate.
    deployment:
    podAnnotations:
    linkerd.io/inject: ingress
  3. Install linkerd in linkerd namespace with below values.
    proxy:
    defaultInboundPolicy: "all-authenticated"
  4. Deploy an application in default namespace.
  5. Annotate default namespace with linkerd.io/inject=enabled
    kubectl annotate namespace default linkerd.io/inject=enabled 
  6. To inject the Linkerd sidecar, restart the pod in the default namespace.
  7. Create router in traefik in values.yaml.
    routers:
    example-service:
    entryPoints:
    - websecure
    rule: "Host(`example.app.com`) && PathPrefix(`/`)"
    tls:
    certResolver: leresolver
    service: example-service
    services:
    example-service: 
    loadBalancer:
      servers:
        - url: http://example-service.default.svc.cluster.local:8000
  8. In browser, try to access example.app.com I can't access the application.

Logs, error output, etc

logs from traefik linkerd-proxy container

[  3083.814154s]  INFO ThreadId(01) inbound:server{port=8443}: linkerd_app_inbound::policy::tcp: Connection denied server.group= server.kind=default server.name=all-authenticated tls=Some(Passthru { sni: ServerId(Name("example.app.com")) }) client=10.42.0.1:54877
[  3083.814178s]  INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=unauthorized connection on default/all-authenticated client.addr=10.42.0.1:54877 server.addr=10.42.0.58:8443

output of linkerd check -o short

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2024-06-21T09:18:37Z
    see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running stable-2.14.10 but cli running edge-24.6.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-identity-6dbb555cf7-q9c8g (stable-2.14.10)
        * metrics-api-b85485b99-2cbkt (stable-2.14.10)
        * web-58979b9448-72znj (stable-2.14.10)
        * tap-injector-c48598d4c-chc25 (stable-2.14.10)
        * tap-7999d688ff-kgzqh (stable-2.14.10)
        * linkerd-proxy-injector-7f6964c9b9-fx8vx (stable-2.14.10)
        * linkerd-destination-5dc7694bc5-t4glt (stable-2.14.10)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-identity-6dbb555cf7-q9c8g running stable-2.14.10 but cli running edge-24.6.2
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-identity-6dbb555cf7-q9c8g (stable-2.14.10)
        * metrics-api-b85485b99-2cbkt (stable-2.14.10)
        * web-58979b9448-72znj (stable-2.14.10)
        * tap-injector-c48598d4c-chc25 (stable-2.14.10)
        * tap-7999d688ff-kgzqh (stable-2.14.10)
        * linkerd-proxy-injector-7f6964c9b9-fx8vx (stable-2.14.10)
        * linkerd-destination-5dc7694bc5-t4glt (stable-2.14.10)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    linkerd-identity-6dbb555cf7-q9c8g running stable-2.14.10 but cli running edge-24.6.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

$ linkerd version Client version: edge-24.6.2 Server version: stable-2.14.10

$ helm version version.BuildInfo{Version:"v3.11.3", GitCommit:"323249351482b3bbfc9f5004f65d400aa70f9ae7", GitTreeState:"clean", GoVersion:"go1.20.3"}

$ kubectl version --short Client Version: v1.27.1 Kustomize Version: v5.0.1 Server Version: v1.25.6+k3s1

Cluster type: Single node on-prem K3S

Ingress Controller: Traefik v2.9.8

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

alpeb commented 1 week ago

You'll need some extra config to get Traefik to play nice with Linkerd. Please check the detailed instructions in the docs

palashbasik commented 1 week ago

I tried adding middleware but still the same problem.

Below is the snippet of traefik configmap.

http:
  middlewares:
    l5d-header:
      headers:
        customRequestHeaders:
          l5d-dst-override: "example-service.default.svc.cluster.local:8000"  
  routers:
    example-service:
    entryPoints:
      - websecure
    rule: "Host(`example.app.com`) && PathPrefix(`/`)"
    middleware:
      - l5d-header
    tls:
      certResolver: leresolver
    service: example-service
  services:
    example-service: 
      loadBalancer:
        servers:
          - url: http://example-service.default.svc.cluster.local:8000          

Note: I am using letsencrypt certResolver in traefik for TLS.

kflynn commented 3 days ago

@palashbasik Have you meshed Traefik using linkerd.io/inject: ingress? It's not hard to miss that bit in our docs for Traefik v2... 😐

kflynn commented 3 days ago

@palashbasik I see that you listed that you're using ingress mode above, it's worth doublechecking. 🙂 But also: instead of the Traefik configmap, can we see the YAML you're configuring Traefik with?

palashbasik commented 3 days ago

Below is the override-values.yaml file for Traefik.

deployment:
  replicas: 3
  podAnnotations:
    linkerd.io/inject: ingress
# Pod disruption budget
podDisruptionBudget:
  enabled: true
  # maxUnavailable: 1
  # maxUnavailable: 33%
  minAvailable: 1
  # minAvailable: 25%

# Enable experimental features
experimental:
  v3:
    enabled: true
  plugins:
    enabled: true

# Create an IngressRoute for the dashboard
ingressRoute:
  dashboard:
    enabled: true

## Logs
## https://docs.traefik.io/observability/logs/
logs:
  ## Traefik logs concern everything that happens to Traefik itself (startup, configuration, events, shutdown, and so on).
  general:
    # By default, the logs use a text format (common), but you can also ask for the json format in the format option
    # format: json
    # By default, the level is set to ERROR.
    # Alternative logging levels are DEBUG, PANIC, FATAL, ERROR, WARN, and INFO.
    level: INFO
  access:
    # To enable access logs
    enabled: true
    ## By default, logs are written using the Common Log Format (CLF) on stdout.
    ## To write logs in JSON, use json in the format option.
    format: json
    # filePath: "/var/log/traefik/access.log
    ## To write the logs in an asynchronous fashion, specify a bufferingSize option.
    ## This option represents the number of log lines Traefik will keep in memory before writing
    ## them to the selected output. In some cases, this option can greatly help performances.
    # bufferingSize: 100
    ## Filtering https://docs.traefik.io/observability/access-logs/#filtering
    filters: {}
      # statuscodes: "200,300-302"
      # retryattempts: true
      # minduration: 10ms
    ## Fields
    ## https://docs.traefik.io/observability/access-logs/#limiting-the-fieldsincluding-headers
    fields:
      general:
        defaultmode: keep
        names:
          StartUTC: drop    
          StartLocal: drop   
          RouterName: drop      
          ServiceAddr: drop  
          ClientPort: drop 
          ClientUsername: drop      
          RequestHost: drop 
          RequestPort: drop     
          RequestMethod: drop 
          RequestPath: drop 
          RequestProtocol: drop 
          RequestScheme: drop  
          RequestContentSize: drop  
          OriginDuration: drop  
          OriginContentSize: drop   
          OriginStatus: drop    
          OriginStatusLine: drop        
          DownstreamStatusLine: drop        
          RequestCount: drop    
          GzipRatio: drop 
          Overhead: drop    
          TLSVersion: drop 
          TLSCipher: drop  

metrics:
  ## Prometheus is enabled by default.
  ## It can be disabled by setting "prometheus: null"
  prometheus:
    ## Entry point used to expose metrics.
    entryPoint: metrics
    addEntryPointsLabels: true
    addRoutersLabels: true
    addServicesLabels: true
    ## Buckets for latency metrics. Default="0.1,0.3,1.2,5.0"
    # buckets: "0.5,1.0,2.5"
    ## When manualRouting is true, it disables the default internal router in
    ## order to allow creating a custom router for prometheus@internal service.
    # manualRouting: true

tracing:
  jaeger:
    collector:
      endpoint: http://jaeger-collector.monitoring.svc.cluster.local:14268/api/traces

secret:
  enabled: true 

# Environment variables to be passed to Traefik's binary
env: 
  - name: CLOUDFLARE_EMAIL
    value: <your-email-id>
  - name: CLOUDFLARE_API_KEY
    valueFrom:
      secretKeyRef:
        name: traefik-secret
        key: CLOUDFLARE_API_KEY

# Configure ports
ports:
  web:
    expose: false           
  websecure:
    # Enable this entrypoint as a default entrypoint. When a service doesn't explicity set an entrypoint it will only use this entrypoint.
    # asDefault: true
    tls:
      enabled: true
      # this is the name of a TLSOption definition
      # options: ""
      certResolver: "leresolver"
      # domains: []    

## Create HorizontalPodAutoscaler object.
##
autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 50
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60

# Enable persistence using Persistent Volume Claims
# ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
# It can be used to store TLS certificates, see `storage` in certResolvers
persistence:
  enabled: true

certResolvers: 
  leresolver:
    # for challenge options cf. https://doc.traefik.io/traefik/https/acme/
    email: <your-email-id>
    dnsChallenge:
      # also add the provider's required configuration under env
      # or expand then from secrets/configmaps with envfrom
      # cf. https://doc.traefik.io/traefik/https/acme/#providers
      provider: cloudflare
      # add futher options for the dns challenge as needed
      # cf. https://doc.traefik.io/traefik/https/acme/#dnschallenge
      delayBeforeCheck: 30
      resolvers:
        - 1.1.1.1
        - 8.8.8.8
    tlsChallenge: false
    # httpChallenge:
    #   entryPoint: "web"
    # It has to match the path with a persistent volume
    storage: /data/acme.json

additionalArguments:
  - "--providers.file.filename=/config/config.yaml"
volumes:
  - name: '{{ printf "%s-configs" .Release.Name }}'
    mountPath: '/config'
    type: configMap

resources:
  requests:
    cpu: "100m"
    memory: "1Gi"
  limits:
    cpu: "500m"
    memory: "2Gi"

config: |-
  http:
    middlewares:
      corsHeader:
        headers:
          accessControlAllowCredentials: true
          accessControlAllowHeaders: 
          - Accept
          - Access-Control-Request-Headers 
          - Access-Control-Request-Method 
          - Authorization 
          - Content-Type 
          - Last-Modified 
          - Origin 
          - X-Requested-With
          - Sec-WebSocket-Key
          accessControlAllowMethods: "*"
          accessControlAllowOriginList: 
          - http://localhost:3000              
          accessControlMaxAge: 100
          addVaryHeader: true
      basic-admin-auth:
        basicAuth:
          users:
            # password - password - hashed with bcrypt
            - "admin:$2a$12$fpgiRwj7e2XBv/U4LWDvr.Jr7sRPECklDxitBdXDkBzLS6r4TU5Pm"
      strip-service-prefix:
        # Modifies "/team/hello" to "/hello"
        replacePathRegex:
          regex: '^/$1/$1/(.*)'
          #regex: '^/.*?/(.*)'
          replacement: '/$1'             
    routers:
      example-service:
        entryPoints:
          - websecure
        # Should prevent any route containing the word "internal" to be blocked
        rule: "Host(`example.app.com`) && PathPrefix(`/`)"
        middlewares:
          - strip-service-prefix
        tls:
          certResolver: leresolver          
        service: example-service
    services:
      # Define how to reach an existing service on our infrastructure
       example-service:
        loadBalancer:
          servers:
              - url: http://example-service.default.svc.cluster.local:8000            

With the provided Traefik configuration and Linkerd deployed with the defaultInboundPolicy set to "all-authenticated", I can't access https://example.app.com from browser.

Note: The host example.app.com mentioned above is solely for illustrative purposes.