Metrics is not available when using AWS EKS + Weavenet

zerowebcorp commented 4 years ago

Describe the bug The metrics feature is not available when on EKS cluster with weavenet as the CNI used. This works fine on bare metal installation. After removing the AWS CNI from the EKS cluster, the installation of the Metrics server requires the networking to change to HostNetwork = true and communication IP = Internal. I have noticed that Lens uses its own Metrics server and Prometheus so there may be some additional tweaks required.

To Reproduce Steps to reproduce the behavior:

Setup a cluster on AWS EKS
Remove AWS CNI
Install Weavenet
Connect using Lens and install the inbuilt metrics server

Expected behavior Lens UI shows the metrics

Screenshots Lens UI doesn't show metrics and complains " Metrics not available at the moment"

Environment (please complete the following information):

Lens Version: 3.5.1
OS: [e.g. OSX] Windows

toabi commented 4 years ago

Funny, I just replaced the AWS VPC CNI with Calico and had the same effect… Metrics don't work and the whole thing is generally more sluggish than before. It worked before so it really has to do something with CNIs.

Lens 3.5.2 on macOS

simwak commented 4 years ago

Exactly the same with EKS and Cilium. No metrics and a lot slower than before.

Lens 3.5.2 on Windows

toabi commented 4 years ago

Yes, after a few days using Lens with a EKS+CNI cluster it's pretty close to unusable actually :(

I wonder if it's so slow because it tries to always fetch some metrics which it can't?

simwak commented 4 years ago

Is it in overlay mode or the EKS standard CNI mode (every Pod get's a VPC routable adress)? Secondary private IPs on your worker nodes mean you use the EKS CNI mode.

toabi commented 4 years ago

No it's a full replacement with overlay network. Pods get some 192.168.* addresses.

Siddharthk commented 4 years ago

same issue using Lens 3.5.3 with EKS using AWS CNI + Calico

toabi commented 4 years ago

The issue still exists in 3.6

mordax7 commented 3 years ago

Problem still persist in 3.6.8

gelblars commented 3 years ago

Workaround: just edit the StatefulSet of prometheus and add hostNetwork: true to the spec section

...
spec:
  replicas: 1
  selector:
    matchLabels:
      name: prometheus
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: prometheus
    spec:
      volumes:
        - name: config
          configMap:
            name: prometheus-config
            defaultMode: 420
        - name: rules
          configMap:
            name: prometheus-rules
            defaultMode: 420
      initContainers:
        - name: chown
          image: 'docker.io/alpine:3.9'
          command:
            - chown
            - '-R'
            - '65534:65534'
            - /var/lib/prometheus
          resources: {}
          volumeMounts:
            - name: data
              mountPath: /var/lib/prometheus
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      containers:
        - name: prometheus
          image: 'docker.io/prom/prometheus:v2.17.2'
          args:
            - '--web.listen-address=0.0.0.0:9090'
            - '--config.file=/etc/prometheus/prometheus.yaml'
            - '--storage.tsdb.path=/var/lib/prometheus'
            - '--storage.tsdb.retention.time=2d'
            - '--storage.tsdb.retention.size=5GB'
            - '--storage.tsdb.min-block-duration=2h'
            - '--storage.tsdb.max-block-duration=2h'
          ports:
            - name: web
              hostPort: 9090
              containerPort: 9090
              protocol: TCP
          resources:
            requests:
              cpu: 100m
              memory: 512Mi
          volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: rules
              mountPath: /etc/prometheus/rules
            - name: data
              mountPath: /var/lib/prometheus
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: 9090
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 10
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /-/ready
              port: 9090
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 10
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: prometheus
      serviceAccount: prometheus
      hostNetwork: true
      securityContext: {}
...

randrusiak commented 3 years ago

I confirm that problem still exists in 4.1.2 version. Using hostNetwork for prometheus it's not a problem but I also noticed that lens works slower with EKS + calico CNI. Why does the CNI have impact for lens performance? Does anyone know?

PatTheSilent commented 3 years ago

I've had issues with calico and EKS as well. I don't have a clue why it happens so but it eventually forced me reverting back to the VPC CNI. Ever since I did that everything works fine. I know it's not much of an input, just wanted to share that custom CNIs have problems on EKS and it's very much reproducible.

randrusiak commented 3 years ago

@PatTheSilent Could you tell me what kind of issues did you have with calico? I'm asking because I didn't noticed any issues besides these related with lens. I will appreciate if you share some details.

PatTheSilent commented 3 years ago

@randrusiak from what I could gather what I was experiencing had a chance of happening. I've mostly had issues with cross az traffic and connecting to stuff outside the cluster, like an RDS database. I'm not excluding I misconfigured something but I've spend quite some time on ENI adjustments, Security Groups, IAM permissions, calico docs and their GitHub issues to kinda be safe saying I didn't mess it up.

Edit: oh, and the fact that basically no admission webhooks or in fact any other work because they're in a completely different network than the control plane and you have to go and manage ports and hostNetworks FOR EACH AND EVERY DAMN EXTERNAL COMPONENT. Calling it a PITA is an understatement.

randrusiak commented 3 years ago

@PatTheSilent Maybe they improved calico or you misconfigured something as you said because I tested calico on my EKS cluster and everything working as expected. Of course, there is still a need to use hostNetwork for services that should be visible for the control plane such us metrics etc.

Nokel81 commented 3 years ago

Is this still an issue for 4.1.4?

mvlm commented 3 years ago

Hi, Just leawe aws-node daemonset on plase. And setup calico-node too(calico manifest from the AWS docs). You will get working calico policy and possible EKS controll plane connections to the worker nodes. Calico docummentation little wrong. You shouldn't delete aws-node.

чт, 25 мар. 2021 г., 21:56 Sebastian Malton @.***>:

Is this still an issue for 4.1.4?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lensapp/lens/issues/561#issuecomment-807377425, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUL62HC2TE2EKAO3WJ2ILTFOIPNANCNFSM4ORRDBUQ .

Nokel81 commented 3 years ago

Okay thanks, so this sounds like it is resolved.

toabi commented 3 years ago

Hi, Just leawe aws-node daemonset on plase. And setup calico-node too(calico manifest from the AWS docs). You will get working calico policy and possible EKS controll plane connections to the worker nodes. Calico docummentation little wrong. You shouldn't delete aws-node.

Well, if you only care about the Policy, that's okay. BUT there are cases in which you want Calico CNI instead of AWS CNI. AWS CNI will consume IPs of your AWS Subnets per Pod.

I mean calico itself tells you about those options: https://docs.projectcalico.org/getting-started/kubernetes/managed-public-cloud/eks

I don't use Lens anymore so I can't tell if it's solved. But this ticket is especially about Calico CNI on EKS. And "Don't use Calico CNI" is not a valid resolution ;)

rmalchow commented 2 years ago

another workaround

apply a (possibly modified) version of the manifest below, and configure lens to point to "prometheus/haproxy:9090":


---
# haproxy config
apiVersion: v1
kind: ConfigMap
metadata:
  name: haproxy
  namespace: prometheus
data:
  haproxy.cfg: |+
    global
      log /dev/log  local0
      log /dev/log  local1 notice
      daemon

    defaults
      log  global
      mode  tcp
      option  tcplog
      option  dontlognull
            timeout connect 5000
            timeout client  50000
            timeout server  50000

    frontend haproxynode
        bind *:9090
        mode http
        default_backend backendnodes

    backend backendnodes
        mode http

        ##### your prometheus SERVICE
        server prometheus prometheus-foobar.prometheus.svc.cluster.local:9090 check
        ##### /your prometheus SERVICE

---
# haproxy for proxying to prometheus. key points:
#
#      dnsPolicy: ClusterFirstWithHostNet
#      hostNetwork: true
#
# also, this is outside the reach of the operator.
#
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: haproxy
  namespace: prometheus
spec:
  selector:
    matchLabels:
      app: prometheus-proxy
  template:
    metadata:
      labels:
        app: prometheus-proxy
    spec:
      containers:
      - image: haproxy
        imagePullPolicy: Always
        name: haproxy
        volumeMounts:
        - mountPath: /usr/local/etc/haproxy
          name: vol1
      dnsConfig: {}
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      restartPolicy: Always
      volumes:
      - configMap:
          defaultMode: 0777
          name: haproxy
          optional: false
        name: vol1
---
# service resource pointing to the
# haproxy. this service goes into
# the manual lens configuration
apiVersion: v1
kind: Service
metadata:
  name: haproxy
  namespace: prometheus
spec:
  clusterIP: None
  type: ClusterIP
  ports:
  - name: prometheus
    port: 9090
    protocol: TCP
    targetPort: 9090
  selector:
    app: prometheus-proxy

lensapp / lens

Metrics is not available when using AWS EKS + Weavenet #561

another workaround