Rook Ceph: When mgr pod is restarted then dashboard port forward does not work

surajssd commented 4 years ago

This is what happens when you do port forward and try to reach localhost:8443 on browser it errors like following, and on browser you see "Connection Closed":

$ kubectl -n rook port-forward svc/rook-ceph-mgr-dashboard 8443:8443
Forwarding from 127.0.0.1:8443 -> 8443
Forwarding from [::1]:8443 -> 8443
Handling connection for 8443
E0609 00:11:54.021904 2368833 portforward.go:400] an error occurred forwarding 8443 -> 8443: error forwarding port 8443 to pod db763885aa7d42544815ffeda2b36e334a24a59094c7e84a7569c864475c77a1, uid : exit status 1: 2020/06/08 18:41:53 socat[1448] E connect(6, AF=2 127.0.0.1:8443, 16): Connection refused Handling connection for 8443

The error specifically is: E0609 00:11:54.021904 2368833 portforward.go:400] an error occurred forwarding 8443 -> 8443: error forwarding port 8443 to pod db763885aa7d42544815ffeda2b36e334a24a59094c7e84a7569c864475c77a1, uid : exit status 1: 2020/06/08 18:41:53 socat[1448] E connect(6, AF=2 127.0.0.1:8443, 16): Connection refused Handling connection for 8443.

To fix this start a toolbox pod and run following command:

ceph config set mgr mgr/dashboard/server_addr 0.0.0.0

Figure out a way to fix this automatically and there should not be any requirement of such a manual change. ~One possible fix could be to deploy multiple mgr pods, if that is possible.~

If there is no way to fix this then at least document that such an error exists and user should deal with it manually.

Toolbox pod config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rook-ceph-tools
  namespace: rook
  labels:
    app: rook-ceph-tools
spec:
  replicas: 1
  selector:
    matchLabels:
      app: rook-ceph-tools
  template:
    metadata:
      labels:
        app: rook-ceph-tools
    spec:
      # affinity:
      #   nodeAffinity:
      #     requiredDuringSchedulingIgnoredDuringExecution:
      #       nodeSelectorTerms:
      #         - matchExpressions:
      #           - key: "pool.onesignal.io"
      #             operator: In
      #             values:
      #             - storage
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: rook-ceph-tools
        image: rook/ceph:master
        command: ["/tini"]
        args: ["-g", "--", "/usr/local/bin/toolbox.sh"]
        imagePullPolicy: IfNotPresent
        env:
          - name: ROOK_ADMIN_SECRET
            valueFrom:
              secretKeyRef:
                name: rook-ceph-mon
                key: admin-secret
        volumeMounts:
          - mountPath: /etc/ceph
            name: ceph-config
          - name: mon-endpoint-volume
            mountPath: /etc/rook
      volumes:
        - name: mon-endpoint-volume
          configMap:
            name: rook-ceph-mon-endpoints
            items:
            - key: data
              path: mon-endpoints
        - name: ceph-config
          emptyDir: {}
      # tolerations:
      #   - key: "pool.onesignal.io"
      #     operator: "Equal"
      #     value: storage
      #     effect: "NoSchedule"

invidian commented 4 years ago

@surajssd is this fixed after recent rook ceph updates?

surajssd commented 4 years ago

This is still an issue and the mitigation mentioned in the description does not work anymore.

The setting server_addr is applied but is not effective:

[root@rook-ceph-tools-8656784d5-nkbcf /]# ceph config get mgr
WHO     MASK  LEVEL     OPTION                              VALUE    RO
global        basic     log_file                                     * 
mgr           advanced  mgr/balancer/active                 true       
mgr           advanced  mgr/balancer/mode                   upmap      
mgr           advanced  mgr/dashboard/server_addr           0.0.0.0  * 
global        advanced  mon_allow_pool_delete               true       
global        advanced  mon_cluster_log_file                           
global        advanced  mon_pg_warn_min_per_osd             0          
global        advanced  osd_pool_default_pg_autoscale_mode  on         
global        advanced  rbd_default_features                3

kinvolk / lokomotive

Rook Ceph: When mgr pod is restarted then dashboard port forward does not work #585