centos 7.6 gk-deploy stuck in heketi - never ending error

ananbas commented 5 years ago

I'm running kubernetes in 3 masters, 3 workers. when deploying gk-deploy, always stuck in heketi topology load. kernel modules loaded. check.

it's been two days, huge amount of browsing with related search with no working solution, so many ./gk-deploy -gv --abort && ./gk-deploy -gv :)

Creating cluster ... ID: 67c906750759278fc0ce67fd7f100c43
Allowing file volumes on cluster.
Allowing block volumes on cluster.
Creating node bdprdsn0002 ... return 0
heketi topology loaded.
/bin/kubectl -n default exec -i deploy-heketi-5f6c465bb8-fk6pt -- heketi-cli -s http://localhost:8080 --user admin --secret '' setup-openshift-heketi-storage --listfile=/tmp/heketi-storage.json  2>&1
Error: Failed to allocate new volume: No nodes in cluster
command terminated with exit code 255
Failed on setup openshift heketi storage
This may indicate that the storage must be wiped and the GlusterFS nodes must be reset.

Run heketi topology load from inside pod got no information, just exit immediately

[root@bdprdmn0003 deploy]# /bin/kubectl -n default exec -i deploy-heketi-5f6c465bb8-fk6pt -- heketi-cli -s http://localhost:8080 --user admin --secret '' topology load --json=/etc/heketi/topology.json
Creating cluster ... ID: ab6482fea4a49b1f0abb272e52ad595e
    Allowing file volumes on cluster.
    Allowing block volumes on cluster.
    Creating node bdprdsn0002 ...

Run heketi topology load in master node, got Unable to create node: New Node doesn't have glusterd running

[root@bdprdmn0003 deploy]# heketi-cli -s http://10.110.220.216:8080 --user admin --secret '' topology load --json=topology.json
Creating cluster ... ID: 65fa756407316218be19fc81fea9e516
    Allowing file volumes on cluster.
    Allowing block volumes on cluster.
    Creating node bdprdsn0002 ... Unable to create node: New Node doesn't have glusterd running
    Creating node bdprdsn0003 ... Unable to create node: New Node doesn't have glusterd running
    Creating node bdprdsn0004 ... Unable to create node: New Node doesn't have glusterd running

kubectl get pod

NAME                             READY   STATUS    RESTARTS   AGE
deploy-heketi-5f6c465bb8-fk6pt   1/1     Running   0          19m
glusterfs-b8qms                  1/1     Running   0          20m
glusterfs-hrjfp                  1/1     Running   0          20m
glusterfs-kcqvp                  1/1     Running   0          20m

kubectl get nodes

NAME          STATUS   ROLES    AGE   VERSION
bdprdmn0003   Ready    master   19h   v1.13.3
bdprdmn0004   Ready    master   19h   v1.13.3
bdprdmn0005   Ready    master   19h   v1.13.3
bdprdsn0002   Ready    <none>   19h   v1.13.3
bdprdsn0003   Ready    <none>   19h   v1.13.3
bdprdsn0004   Ready    <none>   19h   v1.13.3

topology file

{
  "clusters": [ {
      "nodes": [ {
          "node": {
            "hostnames": {
              "manage": [ "bdprdsn0002" ],
              "storage": [  "10.30.225.38" ]
            },
            "zone": 1
          },
          "devices": [ "/dev/sdc" ]
        }, {
          "node": {
            "hostnames": {
              "manage": [ "bdprdsn0003" ],
              "storage": [  "10.30.225.63" ]
            },
            "zone": 1
          },
          "devices": [ "/dev/sdc" ]
        }, {
          "node": {
            "hostnames": {
              "manage": [ "bdprdsn0004" ],
              "storage": [ "10.30.225.90" ]
            },
            "zone": 1
          },
          "devices": [ "/dev/sdc" ]
        }]} ]}

/dev/sdc is 1TB hdd, no pvs, unformatted.

edit: firewalld disabled. selinux disabled. swap disabled.

thank you

BR, Anung

ananbas commented 5 years ago

heketi pod log

[heketi] INFO 2019/02/24 06:27:40 Starting Node Health Status refresh
[heketi] INFO 2019/02/24 06:27:40 Cleaned 0 nodes from health cache
[negroni] Started GET /clusters
[negroni] Completed 200 OK in 176.711µs
[negroni] Started POST /clusters
[negroni] Completed 201 Created in 643.227µs
[negroni] Started POST /nodes
[cmdexec] INFO 2019/02/24 06:29:33 Check Glusterd service status in node bdprdsn0002
[heketi] INFO 2019/02/24 06:29:40 Starting Node Health Status refresh
[heketi] INFO 2019/02/24 06:29:40 Cleaned 0 nodes from health cache
[negroni] Completed 400 Bad Request in 30.001699578s
[kubeexec] ERROR 2019/02/24 06:30:03 heketi/pkg/remoteexec/kube/target.go:134:kube.TargetDaemonSet.GetTargetPod: Get https://10.96.0.1:443/api/v1/namespaces/default/pods?labelSelector=glusterfs-node: dial tcp 10.96.0.1:443: i/o timeout
[kubeexec] ERROR 2019/02/24 06:30:03 heketi/pkg/remoteexec/kube/target.go:135:kube.TargetDaemonSet.GetTargetPod: Failed to get list of pods
[cmdexec] ERROR 2019/02/24 06:30:03 heketi/executors/cmdexec/peer.go:81:cmdexec.(*CmdExecutor).GlusterdCheck: Failed to get list of pods
[heketi] ERROR 2019/02/24 06:30:03 heketi/apps/glusterfs/app_node.go:107:glusterfs.(*App).NodeAdd: Failed to get list of pods
[heketi] ERROR 2019/02/24 06:30:03 heketi/apps/glusterfs/app_node.go:108:glusterfs.(*App).NodeAdd: New Node doesn't have glusterd running
[negroni] Started POST /nodes
[cmdexec] INFO 2019/02/24 06:30:03 Check Glusterd service status in node bdprdsn0003
[negroni] Completed 400 Bad Request in 30.001886914s
[kubeexec] ERROR 2019/02/24 06:30:33 heketi/pkg/remoteexec/kube/target.go:134:kube.TargetDaemonSet.GetTargetPod: Get https://10.96.0.1:443/api/v1/namespaces/default/pods?labelSelector=glusterfs-node: dial tcp 10.96.0.1:443: i/o timeout
[kubeexec] ERROR 2019/02/24 06:30:33 heketi/pkg/remoteexec/kube/target.go:135:kube.TargetDaemonSet.GetTargetPod: Failed to get list of pods
[cmdexec] ERROR 2019/02/24 06:30:33 heketi/executors/cmdexec/peer.go:81:cmdexec.(*CmdExecutor).GlusterdCheck: Failed to get list of pods
[heketi] ERROR 2019/02/24 06:30:33 heketi/apps/glusterfs/app_node.go:107:glusterfs.(*App).NodeAdd: Failed to get list of pods
[heketi] ERROR 2019/02/24 06:30:33 heketi/apps/glusterfs/app_node.go:108:glusterfs.(*App).NodeAdd: New Node doesn't have glusterd running
[negroni] Started POST /nodes
[cmdexec] INFO 2019/02/24 06:30:33 Check Glusterd service status in node bdprdsn0004
[kubeexec] ERROR 2019/02/24 06:31:03 heketi/pkg/remoteexec/kube/target.go:134:kube.TargetDaemonSet.GetTargetPod: Get https://10.96.0.1:443/api/v1/namespaces/default/pods?labelSelector=glusterfs-node: dial tcp 10.96.0.1:443: i/o timeout
[kubeexec] ERROR 2019/02/24 06:31:03 heketi/pkg/remoteexec/kube/target.go:135:kube.TargetDaemonSet.GetTargetPod: Failed to get list of pods
[cmdexec] ERROR 2019/02/24 06:31:03 heketi/executors/cmdexec/peer.go:81:cmdexec.(*CmdExecutor).GlusterdCheck: Failed to get list of pods
[heketi] ERROR 2019/02/24 06:31:03 heketi/apps/glusterfs/app_node.go:107:glusterfs.(*App).NodeAdd: Failed to get list of pods
[heketi] ERROR 2019/02/24 06:31:03 heketi/apps/glusterfs/app_node.go:108:glusterfs.(*App).NodeAdd: New Node doesn't have glusterd running

trying curl

curl https://10.96.0.1:443/api/v1/namespaces/default/pods?labelSelector=glusterfs-node -k
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "pods is forbidden: User \"system:anonymous\" cannot list resource \"pods\" in API group \"\" in the namespace \"default\"",
  "reason": "Forbidden",
  "details": {
    "kind": "pods"
  },
  "code": 403
}

hmm, i think it is permission error?

ananbas commented 5 years ago

now. I got different error, Error: Failed to allocate new volume: No online storage devices in cluster


Checking status of pods matching '--selector=deploy-heketi=pod':
deploy-heketi-5f6c465bb8-dcr9l   1/1   Running   0     16s
OK
Determining heketi service URL ... OK
/bin/kubectl -n default exec -i deploy-heketi-5f6c465bb8-dcr9l -- heketi-cli -s http://localhost:8080 --user admin --secret '' topology load --json=/etc/heketi/topology.json 2>&1
Creating cluster ... ID: 833a3f14267b9d2708a1a36aee21752f
Allowing file volumes on cluster.
Allowing block volumes on cluster.
Creating node bdprdsn0002 ... ID: 84ee287304ae70226d6c068ad5b0021a
Adding device /dev/sdc ... return 0
heketi topology loaded.
/bin/kubectl -n default exec -i deploy-heketi-5f6c465bb8-dcr9l -- heketi-cli -s http://localhost:8080 --user admin --secret '' setup-openshift-heketi-storage --listfile=/tmp/heketi-storage.json  2>&1
Error: Failed to allocate new volume: No online storage devices in cluster
command terminated with exit code 255
Failed on setup openshift heketi storage
This may indicate that the storage must be wiped and the GlusterFS nodes must be reset.```

ananbas commented 5 years ago

and executing manually in master nodes got too many open files in heketi pod

[root@bdprdmn0003 deploy]# heketi-cli topology load --json=topology.json
    Found node bdprdsn0002 on cluster 833a3f14267b9d2708a1a36aee21752f
        Adding device /dev/sdc ... Unable to add device: WARNING: Device /dev/sdc not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vg-root/lv_root not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sda1 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vg-root/lv_usr not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sda2 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vg_data/lv_opt not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sda3 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vg_data/lv_home not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sda5 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sda6 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vg-root/lv_tmp not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vg-root/lv_var not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sdb1 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/sdc not initialized in udev database even after waiting 10000000 microseconds.
  Can't initialize physical volume "/dev/sdc" of volume group "vg_f69a0ab1d705f0a84b648f25d5018809" without -ff
  /dev/sdc: physical volume not initialized.
    Creating node bdprdsn0003 ... ID: add9a6ab175d2b4f292d8c2675b84b31
        Adding device /dev/sdc ... OK
    Creating node bdprdsn0004 ... ID: bde73d725835bf12c3284e9d98c6ef08
        Adding device /dev/sdc ... Unable to add device: Get http://10.100.30.141:8080/queue/971117d58b46fc72ddc39a8d0a1f3681: dial tcp 10.100.30.141:8080: socket: too many open files

resuming ./gk-deploy -gv got another error on setup-openshift-heketi-storage

Found node bdprdsn0002 on cluster 833a3f14267b9d2708a1a36aee21752f
Found device /dev/sdc
Found node bdprdsn0003 on cluster 833a3f14267b9d2708a1a36aee21752f
Found device /dev/sdc
Found node bdprdsn0004 on cluster 833a3f14267b9d2708a1a36aee21752f
Found device /dev/sdc
heketi topology loaded.
/bin/kubectl -n default exec -i deploy-heketi-5f6c465bb8-dcr9l -- heketi-cli -s http://localhost:8080 --user admin --secret '' setup-openshift-heketi-storage --listfile=/tmp/heketi-storage.json  2>&1
/bin/kubectl -n default exec -i deploy-heketi-5f6c465bb8-dcr9l -- cat /tmp/heketi-storage.json | /bin/kubectl -n default create -f - 2>&1
cat: /tmp/heketi-storage.json: No such file or directory
command terminated with exit code 1
error: no objects passed to create
Failed on creating heketi storage resources.

just wow

kun-qian commented 5 years ago

Got any answer for that?

charanrajt commented 5 years ago

I have a similar error, any idea?

/usr/local/bin/kubectl -n default exec -i deploy-heketi-5f6c465bb8-d4j7n -- heketi-cli -s http://localhost:8080 --user admin --secret '' topology load --json=/etc/heketi/topology.json 2>&1
Found node ip-10-44-10-51.us-west-1.compute.internal on cluster 9dfb7ecab0b39aae0e0e6b12c235a763
Adding device /dev/xvdf ... return 0
heketi topology loaded.
/usr/local/bin/kubectl -n default exec -i deploy-heketi-5f6c465bb8-d4j7n -- heketi-cli -s http://localhost:8080 --user admin --secret '' setup-openshift-heketi-storage --listfile=/tmp/heketi-storage.json  2>&1
/usr/local/bin/kubectl -n default exec -i deploy-heketi-5f6c465bb8-d4j7n -- cat /tmp/heketi-storage.json | /usr/local/bin/kubectl -n default create -f - 2>&1
cat: /tmp/heketi-storage.json: No such file or directory
command terminated with exit code 1
error: no objects passed to create
Failed on creating heketi storage resources.

lucian521 commented 5 years ago

@ananbas I also encountered this problem. How did you solve it

ananbas commented 5 years ago

Hi Lucian,

I gave up on this and use openebs instead

BR,

On 20 Jul 2019 09.43 +0700, lucian521 notifications@github.com, wrote:

@ananbas I also encountered this problem. How did you solve it — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lucian521 commented 5 years ago

@ananbas Hi , The reason for this problem is that the pod is not connected to the service network, and even the pod cannot access the apiservice. When starting the virtual machine,node should be started first, and then master should be started after node is started heketi logs: [kubeexec] ERROR 2019/02/24 06:30:33 heketi/pkg/remoteexec/kube/target.go:134:kube.TargetDaemonSet.GetTargetPod: Get https://10.96.0.1:443/api/v1/namespaces/default/pods?labelSelector=glusterfs-node: dial tcp 10.96.0.1:443: i/o timeout [kubeexec] ERROR 2019/02/24 06:30:33 heketi/pkg/remoteexec/kube/target.go:135:kube.TargetDaemonSet.GetTargetPod: Failed to get list of pods [cmdexec] ERROR 2019/02/24 06:30:33 heketi/executors/cmdexec/peer.go:81:cmdexec.(CmdExecutor).GlusterdCheck: Failed to get list of pods [heketi] ERROR 2019/02/24 06:30:33 heketi/apps/glusterfs/app_node.go:107:glusterfs.(App).NodeAdd: Failed to get list of pods

gluster / gluster-kubernetes

centos 7.6 gk-deploy stuck in heketi - never ending error #564