Open angapov opened 4 years ago
I've added one disk to each OpenShift worker (total 3 nodes) and recreated the StorageCluster with the following spec: https://pastebin.com/raw/X2RajT8R
Pod status is like that:
# oc -n kube-system get pod
NAME READY STATUS RESTARTS AGE
autopilot-94dc45dbf-g8r95 1/1 Running 0 31m
portworx-api-b46gc 1/1 Running 0 31m
portworx-api-f8kk7 1/1 Running 0 31m
portworx-api-mn8rq 1/1 Running 0 31m
portworx-operator-8467647f7f-2w8r7 1/1 Running 1 9d
px-cluster-828b1a05-8020-4b73-9e39-6f6be66a8abc-gjtqf 1/2 Running 0 4m55s
px-cluster-828b1a05-8020-4b73-9e39-6f6be66a8abc-jbkk4 1/2 Running 0 4m55s
px-cluster-828b1a05-8020-4b73-9e39-6f6be66a8abc-lxrw9 1/2 Running 0 4m55s
px-csi-ext-7444d9b4fc-g7rk5 3/3 Running 0 31m
px-csi-ext-7444d9b4fc-rfjfk 3/3 Running 0 31m
px-csi-ext-7444d9b4fc-vz9ps 3/3 Running 0 31m
px-lighthouse-68dcd48944-bjxkp 0/3 Init:CrashLoopBackOff 7 31m
stork-6f8fc7b967-2ffwq 1/1 Running 0 31m
stork-6f8fc7b967-5xw6b 1/1 Running 0 31m
stork-6f8fc7b967-cwhlk 1/1 Running 0 31m
stork-scheduler-6847c58d8d-9vjtz 1/1 Running 0 31m
stork-scheduler-6847c58d8d-kfj6z 1/1 Running 0 31m
stork-scheduler-6847c58d8d-nfhhw 1/1 Running 0 31m
Logs of px-cluster pod: https://pastebin.com/raw/vTGVqGcQ
Something is definitely goes wrong. Can you help me?
@angapov - can you share the output of lsblk
and blkid
from all the worker nodes ?
[core@worker-0 ~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 120G 0 disk
├─sda1 8:1 0 384M 0 part /boot
├─sda2 8:2 0 127M 0 part /boot/efi
├─sda3 8:3 0 1M 0 part
└─sda4 8:4 0 119.5G 0 part
└─coreos-luks-root-nocrypt 253:0 0 119.5G 0 dm /sysroot
sdb 8:16 0 50G 0 disk
[core@worker-0 ~]$ blkid
/dev/mapper/coreos-luks-root-nocrypt: LABEL="root" UUID="9599ed34-a678-4e04-9bda-675bc2e8ba7b" TYPE="xfs"
[core@worker-1 ~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 120G 0 disk
├─sda1 8:1 0 384M 0 part /boot
├─sda2 8:2 0 127M 0 part /boot/efi
├─sda3 8:3 0 1M 0 part
└─sda4 8:4 0 119.5G 0 part
└─coreos-luks-root-nocrypt 253:0 0 119.5G 0 dm /sysroot
sdb 8:16 0 50G 0 disk
[core@worker-1 ~]$ blkid
/dev/mapper/coreos-luks-root-nocrypt: LABEL="root" UUID="9599ed34-a678-4e04-9bda-675bc2e8ba7b" TYPE="xfs"
[core@worker-2 ~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 120G 0 disk
├─sda1 8:1 0 384M 0 part /boot
├─sda2 8:2 0 127M 0 part /boot/efi
├─sda3 8:3 0 1M 0 part
└─sda4 8:4 0 119.5G 0 part
└─coreos-luks-root-nocrypt 253:0 0 119.5G 0 dm /sysroot
sdb 8:16 0 50G 0 disk
[core@worker-2 ~]$ blkid
/dev/mapper/coreos-luks-root-nocrypt: LABEL="root" UUID="9599ed34-a678-4e04-9bda-675bc2e8ba7b" TYPE="xfs"
@angapov It looks like your nodes have restarted enough times that Portworx is running out of node index. Can you destroy your portworx cluster and re-create it? If it fails, can you paste the logs again?
There are instruction in the portworx operator description on OpenShift about how to cleanly uninstall. Basically add a deleteStrategy to your StorageCluster and then delete it:
spec:
deleteStrategy:
type: UninstallAndWipe
Also, operator 1.2 automatically uses 17001-17020 port range for portworx if running on openshift (to avoid the port conflict introduced in OpenShift 4.3). My guess is that you had an older version of operator before which tried to run on 9001 port. In your latest logs, it seems to be using 17001 port.
@piyush-nimbalkar you are right, I've added deleteStrategy and recreated StorageCluster and it worked nicely. Thank you very much!
Now cluster is running using dynamic VMDK provisioned volumes. However, I noticed that I have 3 worker nodes but only 2 drives.
[root@jumphost ~]# oc -n kube-system exec px-cluster-5s82k -- /opt/pwx/bin/pxctl clouddrive list
Defaulting container name to portworx.
Use 'oc describe pod/px-cluster-5s82k -n kube-system' to see all of the containers in this pod.
Cloud Drives Summary
Number of nodes in the cluster: 3
Number of drive sets in use: 2
List of storage nodes: [90cbbeaf-6157-4b52-bf44-7622f6e08c2f 9e17d9e7-c81f-43d7-ac4f-653c9d8b4a71]
List of storage less nodes: [7b83b83d-7cbb-486f-812b-07609044c096]
Drive Set List
NodeIndex NodeID InstanceID Zone State Drive IDs
2 7b83b83d-7cbb-486f-812b-07609044c096 422d8683-19a6-c332-55ef-daa2210fd7d2 default In Use -
0 90cbbeaf-6157-4b52-bf44-7622f6e08c2f 422dc8c7-946f-ab50-7d13-df7059440f84 default In Use [datastore-10] osd-provisioned-disks/PX-DO-NOT-DELETE-e5051cb5-d323-4a80-ab97-1e880827ccb0.vmdk(data)
1 9e17d9e7-c81f-43d7-ac4f-653c9d8b4a71 422d9bcb-22d2-d373-f25d-1495f26cbe50 default In Use [datastore-34] osd-provisioned-disks/PX-DO-NOT-DELETE-8881c670-1639-4cad-90df-5c932536823b.vmdk(data)
Do you know how can I add disk on storageless node again using dynamic VMDK provisioning?
I tried expanding pool like this but it gave error:
[root@jumphost ~]# oc -n kube-system exec px-cluster-5s82k -- /opt/pwx/bin/pxctl service pool expand -s 100 -u 919993f4-1034-47dd-a60a-feafad8c39c6 -o add-disk
Defaulting container name to portworx.
Use 'oc describe pod/px-cluster-5s82k -n kube-system' to see all of the containers in this pod.
Request to expand pool: 919993f4-1034-47dd-a60a-feafad8c39c6 to size: 100 using operation: add-disk
service pool expand: resizing pool with an auto journal device is not supported
command terminated with exit code 1
@angapov - Can you share the logs from this node, we need to see why it was not able to create/attach the disk to this node.
2 7b83b83d-7cbb-486f-812b-07609044c096 422d8683-19a6-c332-55ef-daa2210fd7d2
You can get the information from pxctl status
Looks like you have journal configured on the data disk, from previous logs I see -j auto
is specified. While installation you can just request for 3GB for Journal partition which will be on different disk then you data.
@sanjaynaikwadi here are the logs: https://pastebin.com/raw/5xPJRFqV
I am interested to install Portworx on OCP 4.3 running on VMware vSphere 6.5 with dynamic VMDK provisioning using Operator. I tried https://central.portworx.com/specGen but there is no option to specify VMware VMDK backend. It gives me errors like that:
PX version is 2.3.6, operator version is 1.2 which is default for OpenShift 4.3, installed from OperatorHub.
I tried https://docs.portworx.com/cloud-references/auto-disk-provisioning/vsphere/ but cluster failed to initialize due to port 9001 conflict with OpenShift oauth-proxy.
Is there any instruction how to do that?
A little bit of background: currently I have three bare metal hosts running ESXi 6.5 with SSD drives, no shared storage. Every SSD is an independent datastore for its ESXi. I've installed vanilla OpenShift 4.3 with dynamic VMware PV provisioning.