Closed wanghui-devops closed 5 months ago
Its possible to switch to host network on a running cluster. Ceph cluster CR needs to be update to add spec.Network.HostNetwork: true
or spec.network.provider:host
.
Once the above change is made, all the ceph daemons (except mons
) will restart and use host network. For mons you need to fail them over manually. You can refer these steps for mon failover.
There is a PR in review to automate the mon failover when switching to host network. But that will be for 1.12. Don't think it will be backported to 1.7. So suggesting to use the latest rook ceph version.
In my test cluster, I configured one mons. When I execute the command kubectl scale deployment root-ceph-mon-a --replicas=0 -nrook-ceph
and set timeout to 0, no mons are restored. Have you ever encountered this situation?
can you share following details:
kubectl get cephclusters.ceph.rook.io rook-ceph -o yaml
kubectl get pods -n rook-ceph
ceph status
output from the toolbox pod.before modifying:
apiVersion: v1
items:
- apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
creationTimestamp: "2023-09-22T04:17:47Z"
finalizers:
- cephcluster.ceph.rook.io
generation: 2
managedFields:
- apiVersion: ceph.rook.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
.: {}
f:cephVersion:
.: {}
f:image: {}
f:cleanupPolicy:
.: {}
f:sanitizeDisks:
.: {}
f:dataSource: {}
f:iteration: {}
f:method: {}
f:crashCollector: {}
f:dashboard:
.: {}
f:enabled: {}
f:ssl: {}
f:dataDirHostPath: {}
f:disruptionManagement:
.: {}
f:machineDisruptionBudgetNamespace: {}
f:managePodBudgets: {}
f:osdMaintenanceTimeout: {}
f:healthCheck:
.: {}
f:daemonHealth:
.: {}
f:mon:
.: {}
f:interval: {}
f:osd: {}
f:status: {}
f:livenessProbe:
.: {}
f:mgr: {}
f:mon: {}
f:osd: {}
f:startupProbe:
.: {}
f:mgr: {}
f:mon: {}
f:osd: {}
f:mgr:
.: {}
f:count: {}
f:modules: {}
f:mon:
.: {}
f:count: {}
f:monitoring: {}
f:network:
.: {}
f:connections:
.: {}
f:compression: {}
f:encryption: {}
f:provider: {}
f:selectors:
.: {}
f:cluster: {}
f:public: {}
f:placement:
.: {}
f:all:
.: {}
f:nodeAffinity:
.: {}
f:requiredDuringSchedulingIgnoredDuringExecution:
.: {}
f:nodeSelectorTerms: {}
f:mgr:
.: {}
f:tolerations: {}
f:mon:
.: {}
f:tolerations: {}
f:osd:
.: {}
f:tolerations: {}
f:priorityClassNames:
.: {}
f:mgr: {}
f:mon: {}
f:osd: {}
f:removeOSDsIfOutAndSafeToRemove: {}
f:storage:
.: {}
f:config:
.: {}
f:storeType: {}
f:useAllDevices: {}
f:waitTimeoutForHealthyOSDInMinutes: {}
manager: kubectl-create
operation: Update
time: "2023-09-22T04:17:47Z"
- apiVersion: ceph.rook.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers: {}
f:spec:
f:external: {}
f:healthCheck:
f:daemonHealth:
f:osd:
f:interval: {}
f:status:
f:interval: {}
f:logCollector: {}
f:security:
.: {}
f:kms: {}
f:storage:
f:nodes: {}
f:status:
.: {}
f:ceph:
.: {}
f:capacity:
.: {}
f:bytesAvailable: {}
f:bytesTotal: {}
f:bytesUsed: {}
f:lastUpdated: {}
f:fsid: {}
f:health: {}
f:lastChanged: {}
f:lastChecked: {}
f:previousHealth: {}
f:versions:
.: {}
f:mgr:
.: {}
f:ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): {}
f:mon:
.: {}
f:ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): {}
f:osd:
.: {}
f:ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): {}
f:overall:
.: {}
f:ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): {}
f:conditions: {}
f:message: {}
f:observedGeneration: {}
f:phase: {}
f:state: {}
f:storage:
.: {}
f:deviceClasses: {}
f:version:
.: {}
f:image: {}
f:version: {}
manager: rook
operation: Update
time: "2023-09-22T04:34:24Z"
name: rook-ceph
namespace: rook-ceph
resourceVersion: "3916613"
selfLink: /apis/ceph.rook.io/v1/namespaces/rook-ceph/cephclusters/rook-ceph
uid: 17dd4990-fb78-499f-a77c-7a577ac52376
spec:
cephVersion:
image: quay.io/ceph/ceph:v17.2.0
cleanupPolicy:
sanitizeDisks:
dataSource: zero
iteration: 1
method: quick
crashCollector: {}
dashboard:
enabled: true
ssl: true
dataDirHostPath: /var/lib/rook
disruptionManagement:
machineDisruptionBudgetNamespace: openshift-machine-api
managePodBudgets: true
osdMaintenanceTimeout: 30
external: {}
healthCheck:
daemonHealth:
mon:
interval: 45s
osd:
interval: 1m0s
status:
interval: 1m0s
livenessProbe:
mgr: {}
mon: {}
osd: {}
startupProbe:
mgr: {}
mon: {}
osd: {}
logCollector: {}
mgr:
count: 1
modules:
- enabled: true
name: pg_autoscaler
mon:
count: 1
monitoring: {}
network:
connections:
compression: {}
encryption: {}
provider: multus
selectors:
cluster: rook-ceph/rook-cluster-nad
public: rook-ceph/rook-public-nad
placement:
all:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- storage-node
mgr:
tolerations:
- key: storage-node
operator: Exists
mon:
tolerations:
- key: storage-node
operator: Exists
osd:
tolerations:
- key: storage-node
operator: Exists
priorityClassNames:
mgr: system-cluster-critical
mon: system-node-critical
osd: system-node-critical
removeOSDsIfOutAndSafeToRemove: true
security:
kms: {}
storage:
config:
storeType: bluestore
nodes:
- config:
metadataDevice: sda
devices:
- name: sdb
- name: sdc
- name: sdd
- name: sde
- name: sdf
- name: sdg
- name: sdh
- name: sdi
- name: sdj
- name: sdk
- name: sdm
- name: sdl
name: dell-34
resources: {}
useAllDevices: false
waitTimeoutForHealthyOSDInMinutes: 10
status:
ceph:
capacity:
bytesAvailable: 192010751717376
bytesTotal: 208011641389056
bytesUsed: 16000889671680
lastUpdated: "2023-09-22T04:35:24Z"
fsid: 0fd74bc0-1e01-4c48-8400-fd05eda63089
health: HEALTH_OK
lastChanged: "2023-09-22T04:34:24Z"
lastChecked: "2023-09-22T04:35:24Z"
previousHealth: HEALTH_WARN
versions:
mgr:
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): 1
mon:
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): 1
osd:
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): 12
overall:
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): 14
conditions:
- lastHeartbeatTime: "2023-09-22T04:35:25Z"
lastTransitionTime: "2023-09-22T04:33:22Z"
message: Cluster created successfully
reason: ClusterCreated
status: "True"
type: Ready
message: Cluster created successfully
observedGeneration: 2
phase: Ready
state: Created
storage:
deviceClasses:
- name: hdd
version:
image: quay.io/ceph/ceph:v17.2.0
version: 17.2.0-0
kind: List
metadata:
resourceVersion: ""
selfLink: "
after modifying :
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
creationTimestamp: "2023-09-22T04:17:47Z"
finalizers:
- cephcluster.ceph.rook.io
generation: 3
managedFields:
- apiVersion: ceph.rook.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
.: {}
f:cephVersion:
.: {}
f:image: {}
f:cleanupPolicy:
.: {}
f:sanitizeDisks:
.: {}
f:dataSource: {}
f:iteration: {}
f:method: {}
f:crashCollector: {}
f:dashboard:
.: {}
f:enabled: {}
f:ssl: {}
f:dataDirHostPath: {}
f:disruptionManagement:
.: {}
f:machineDisruptionBudgetNamespace: {}
f:managePodBudgets: {}
f:osdMaintenanceTimeout: {}
f:healthCheck:
.: {}
f:daemonHealth:
.: {}
f:mon:
.: {}
f:interval: {}
f:osd: {}
f:status: {}
f:livenessProbe:
.: {}
f:mgr: {}
f:mon: {}
f:osd: {}
f:startupProbe:
.: {}
f:mgr: {}
f:mon: {}
f:osd: {}
f:mgr:
.: {}
f:count: {}
f:modules: {}
f:mon:
.: {}
f:count: {}
f:monitoring: {}
f:network:
.: {}
f:connections:
.: {}
f:compression: {}
f:encryption: {}
f:placement:
.: {}
f:all:
.: {}
f:nodeAffinity:
.: {}
f:requiredDuringSchedulingIgnoredDuringExecution:
.: {}
f:nodeSelectorTerms: {}
f:mgr:
.: {}
f:tolerations: {}
f:mon:
.: {}
f:tolerations: {}
f:osd:
.: {}
f:tolerations: {}
f:priorityClassNames:
.: {}
f:mgr: {}
f:mon: {}
f:osd: {}
f:removeOSDsIfOutAndSafeToRemove: {}
f:storage:
.: {}
f:config:
.: {}
f:storeType: {}
f:useAllDevices: {}
f:waitTimeoutForHealthyOSDInMinutes: {}
manager: kubectl-create
operation: Update
time: "2023-09-22T04:17:47Z"
- apiVersion: ceph.rook.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers: {}
f:spec:
f:external: {}
f:healthCheck:
f:daemonHealth:
f:osd:
f:interval: {}
f:status:
f:interval: {}
f:logCollector: {}
f:security:
.: {}
f:kms: {}
f:storage:
f:nodes: {}
f:status:
.: {}
f:ceph:
.: {}
f:capacity:
.: {}
f:bytesAvailable: {}
f:bytesTotal: {}
f:bytesUsed: {}
f:lastUpdated: {}
f:fsid: {}
f:health: {}
f:lastChanged: {}
f:lastChecked: {}
f:previousHealth: {}
f:versions:
.: {}
f:mgr:
.: {}
f:ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): {}
f:mon:
.: {}
f:ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): {}
f:osd:
.: {}
f:ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): {}
f:overall:
.: {}
f:ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): {}
f:conditions: {}
f:message: {}
f:observedGeneration: {}
f:phase: {}
f:state: {}
f:storage:
.: {}
f:deviceClasses: {}
f:version:
.: {}
f:image: {}
f:version: {}
manager: rook
operation: Update
time: "2023-09-22T04:34:24Z"
- apiVersion: ceph.rook.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:healthCheck:
f:daemonHealth:
f:mon:
f:timeout: {}
f:network:
f:provider: {}
manager: kubectl-edit
operation: Update
time: "2023-09-22T04:39:59Z"
name: rook-ceph
namespace: rook-ceph
resourceVersion: "3917817"
selfLink: /apis/ceph.rook.io/v1/namespaces/rook-ceph/cephclusters/rook-ceph
uid: 17dd4990-fb78-499f-a77c-7a577ac52376
spec:
cephVersion:
image: quay.io/ceph/ceph:v17.2.0
cleanupPolicy:
sanitizeDisks:
dataSource: zero
iteration: 1
method: quick
crashCollector: {}
dashboard:
enabled: true
ssl: true
dataDirHostPath: /var/lib/rook
disruptionManagement:
machineDisruptionBudgetNamespace: openshift-machine-api
managePodBudgets: true
osdMaintenanceTimeout: 30
external: {}
healthCheck:
daemonHealth:
mon:
interval: 45s
timeout: 0s
osd:
interval: 1m0s
status:
interval: 1m0s
livenessProbe:
mgr: {}
mon: {}
osd: {}
startupProbe:
mgr: {}
mon: {}
osd: {}
logCollector: {}
mgr:
count: 1
modules:
- enabled: true
name: pg_autoscaler
mon:
count: 1
monitoring: {}
network:
connections:
compression: {}
encryption: {}
provider: host
placement:
all:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- storage-node
mgr:
tolerations:
- key: storage-node
operator: Exists
mon:
tolerations:
- key: storage-node
operator: Exists
osd:
tolerations:
- key: storage-node
operator: Exists
priorityClassNames:
mgr: system-cluster-critical
mon: system-node-critical
osd: system-node-critical
removeOSDsIfOutAndSafeToRemove: true
security:
kms: {}
storage:
config:
storeType: bluestore
nodes:
- config:
metadataDevice: sda
devices:
- name: sdb
- name: sdc
- name: sdd
- name: sde
- name: sdf
- name: sdg
- name: sdh
- name: sdi
- name: sdj
- name: sdk
- name: sdm
- name: sdl
name: dell-34
resources: {}
useAllDevices: false
waitTimeoutForHealthyOSDInMinutes: 10
status:
ceph:
capacity:
bytesAvailable: 192010751717376
bytesTotal: 208011641389056
bytesUsed: 16000889671680
lastUpdated: "2023-09-22T04:40:00Z"
fsid: 0fd74bc0-1e01-4c48-8400-fd05eda63089
health: HEALTH_OK
lastChanged: "2023-09-22T04:34:24Z"
lastChecked: "2023-09-22T04:40:00Z"
previousHealth: HEALTH_WARN
versions:
mgr:
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): 1
mon:
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): 1
osd:
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): 12
overall:
ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable): 14
conditions:
- lastHeartbeatTime: "2023-09-22T04:40:00Z"
lastTransitionTime: "2023-09-22T04:33:22Z"
message: Cluster created successfully
reason: ClusterCreated
status: "True"
type: Ready
- lastHeartbeatTime: "2023-09-22T04:40:04Z"
lastTransitionTime: "2023-09-22T04:40:04Z"
message: Configuring Ceph Mons
reason: ClusterProgressing
status: "True"
type: Progressing
message: Configuring Ceph Mons
observedGeneration: 2
phase: Progressing
state: Creating
storage:
deviceClasses:
- name: hdd
version:
image: quay.io/ceph/ceph:v17.2.0
version: 17.2.0-0
exec command : kubectl scale -n rook-ceph deployment --replicas=0 rook-ceph-mon-a
Can you also share the rook-ceph-operator-*
pod logs after modifying.
Is it because of single node single mons?
most likely. Can you try with 3 mons on a single node?
I tried to start 3 mons on a single node, but even with `allowMultiplePerNode: True, it doesn't work. logs:
rook-ceph-cluster-controller failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: refusing to deploy 3 monitors on the same host with host networking and allowMultiplePerNode is true. only one monitor per node is allowed
Oh right. My bad. That won't work. With host networking the mons will get the IP of the node, so we can't have more than one mon on a node.
Do you know how to rebuild a cluster? Can I rebuild a cluster with my old osd so that the data stays the same
You might want to check the rook disaster recovery guide to see what's the best option available for your cluster.
I deployed a 3-node environment, but still no mon-d is created automatically, and the interval is still 10m, I configured it in healthCheck.daemonHealth.mon.timeout
Strange. After kubectl scale deployment rook-ceph-mon-a --replicas=0 -nrook-ceph
, the operator should failover the mon-a to mon-d after 10 minutes. Can you share the the rook operator logs after you scale down a mon-a.
In my test cluster, I configured one mons. When I execute the command
kubectl scale deployment root-ceph-mon-a --replicas=0 -nrook-ceph
and set timeout to 0, no mons are restored. Have you ever encountered this situation?
In a single-mon cluster, I don't think restoration is possible. At least 51% of mons must be available to restore a cluster.
I tried to start 3 mons on a single node, but even with `allowMultiplePerNode: True, it doesn't work. logs:
The ability to change from non-host to host networking was added in Rook v1.10.5.
Do you know how to rebuild a cluster? Can I rebuild a cluster with my old osd so that the data stays the same
Rebuilding a cluster is possible but risky, especially having only a single monitor. It would be safest to upgrade from v1.7->1.8->1.9->1.10 at minimum. That would allow Rook to change the networking internally. I would recommend upgrading to at least 1.11 which is currently the lowest version number currently under active upstream support.
I upgraded the 3-node cluster to v1.11.11 , then switched the network to “host”, mgr and osd automatically completed the update, and mons I followed the steps to manually “Failing over a Monitor”, but the results did not work as expected, and did not start a new mon , operator log:
ceph status :
@wanghui-devops Can you share the following details
@sp98 my steps:
kubectl edit -n rook-ceph cephclusters.ceph.rook.io rook-ceph
to change network.provider;kubectl -n rook-ceph scale deployment --replicas=0 rook-ceph-mon-a
to waiting to start a new mon ... this is Complete rook-ceph-operator logs; operator.log and ceph status:
cluster:
id: d8233e3f-24d6-46c5-9490-36aa6b73d896
health: HEALTH_WARN
1/3 mons down, quorum f,g
Reduced data availability: 14 pgs inactive
services:
mon: 3 daemons, quorum f,g (age 17m), out of quorum: a
mgr: a(active, since 49m), standbys: b
osd: 3 osds: 3 up (since 81m), 3 in (since 26h)
data:
pools: 2 pools, 33 pgs
objects: 11 objects, 4.2 MiB
usage: 15 TiB used, 15 TiB / 29 TiB avail
pgs: 42.424% pgs unknown
19 active+clean
14 unknown
@sp98 my steps:
- exec
kubectl edit -n rook-ceph cephclusters.ceph.rook.io rook-ceph
to change network.provider;
Looks like the cluster was not in a good shape even before this step was performed. Can you confirm if you had 14 pgs inactive
even before you had updated the host network?
@wanghui-devops
The cluster seems to be stuck at:
2023-09-27 07:02:18.656149 I | op-osd: OSD 1 is not ok-to-stop. will try updating it again later`
So two possibilities here:
host
network was added. or14 pgs inactive
) after the host
network was added. Either case, the cluster is stuck because its not able to stop an OSD. And because of this is not failing over the mon-a.
I'm sure it's the second case, because I checked the cluster state before changing the network. Is this case due to the existence of data in the pg before the change? What should I do when that happens?
@wanghui-devops can you share the complete rook-ceph operator logs in a text file. I need to check the complete logs from the beginning when the operator was first created.
@sp98 My complete steps: My complete steps:
HEALTH_OK
kubectl -n rook-ceph scale deployment --replicas=0 rook-ceph-mon-a
; Complete log:
operator.txtthanks @wanghui-devops Looks like ceph is complaining that the osd is not ok-to-stop
after updating CR to use host
network.
Is this a test cluster? If yes, can you update the cephCluster CR to add skipUpgradeChecks: true
. This will skip any upgrade checks for now. After that you should see the mon-a failing over and a new mon being created with host network. Logs might look something like:
2023-09-28 07:09:25.825051 W | cephclient: skipping adding mon "a" to config file, detected out of quorum
2023-09-28 07:09:25.829594 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2023-09-28 07:09:25.830345 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2023-09-28 07:09:25.852498 W | op-mon: mon "a" not found in quorum, waiting for timeout (599 seconds left) before failover
2023-09-28 07:10:11.371954 W | op-mon: mon "a" not found in quorum, waiting for timeout (554 seconds left) before failover```
@sp98 Do I have to execute orders manually with kubectl -n rook-ceph scale deployment --replicas=0 rook-ceph-mon-a
?
now ,The operator logs :
2023-09-28 08:54:47.751942 W | op-mon: mon "a" NOT found in quorum and health timeout is 0, mon will never fail over
2023-09-28 08:54:47.751973 W | op-mon: monitor failover is disabled
2023-09-28 08:55:33.104082 W | op-mon: mon "a" NOT found in quorum and health timeout is 0, mon will never fail over
2023-09-28 08:55:33.104103 W | op-mon: monitor failover is disabled
2023-09-28 08:56:18.466629 W | op-mon: mon "a" NOT found in quorum and health timeout is 0, mon will never fail over
2023-09-28 08:56:18.466676 W | op-mon: monitor failover is disabled
2023-09-28 08:57:03.808799 W | op-mon: mon "a" NOT found in quorum and health timeout is 0, mon will never fail over
2023-09-28 08:57:03.808820 W | op-mon: monitor failover is disabled
2023-09-28 08:57:49.148884 W | op-mon: mon "a" NOT found in quorum and health timeout is 0, mon will never fail over
2023-09-28 08:57:49.148904 W | op-mon: monitor failover is disabled
@wanghui-devops you should not change the mon healthmonitor timeout to 0s
. Timeout of 0
means we don't want mons to failover. Default value is 10 minutes. You can reduce that value to less than 10 minutes, say 1m
, but don't change it to 0s
.
Basically you need to revert this change that you made:
Disabled: false,
Interval: &{Duration: s"45s"},
- Timeout: "",
+ Timeout: "0s",
If you keep the timeout to empty, then it will take 10 minutes for mons to failover. Zero seconds means no failover.
@sp98 The good news is that the network switch is successful and the mons failover is complete. The bad news is that my test pv cannot be mounted. Do you need to reboot host?
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m57s default-scheduler Successfully assigned default/busybox-rbd-pool-1 to dell-34
Normal SuccessfulAttachVolume 6m57s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-3dcb9f4d-51d8-4076-a474-72260ef9a71d"
Warning FailedMount 4m55s kubelet MountVolume.MountDevice failed for volume "pvc-3dcb9f4d-51d8-4076-a474-72260ef9a71d" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedMount 45s (x9 over 4m55s) kubelet MountVolume.MountDevice failed for volume "pvc-3dcb9f4d-51d8-4076-a474-72260ef9a71d" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-62c32932-33e0-405b-90c1-056a4cb06d67 already exists
Warning FailedMount 20s (x3 over 4m54s) kubelet Unable to attach or mount volumes: unmounted volumes=[my-pv-volume], unattached volumes=[my-pv-volume default-token-9vnkn]: timed out waiting for the condition
can you share kubectl get pods -n rook-ceph -o wide
output?
root@dell-34:~# kubectl get pod -n rook-ceph -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
csi-cephfsplugin-9k4d5 2/2 Running 0 80m 192.168.60.42 dell-42 <none> <none>
csi-cephfsplugin-ctckr 2/2 Running 0 79m 192.168.60.41 dell-41 <none> <none>
csi-cephfsplugin-holder-rook-ceph-6pxjp 1/1 Running 0 8h 10.244.0.78 dell-34 <none> <none>
csi-cephfsplugin-holder-rook-ceph-lwgfv 1/1 Running 0 8h 10.244.2.132 dell-42 <none> <none>
csi-cephfsplugin-holder-rook-ceph-ttwxg 1/1 Running 0 8h 10.244.1.113 dell-41 <none> <none>
csi-cephfsplugin-lkcp7 2/2 Running 0 79m 192.168.60.34 dell-34 <none> <none>
csi-cephfsplugin-provisioner-58948fc785-64nrw 5/5 Running 0 80m 10.244.1.175 dell-41 <none> <none>
csi-cephfsplugin-provisioner-58948fc785-cnx5m 5/5 Running 0 80m 10.244.0.254 dell-34 <none> <none>
csi-rbdplugin-92rvv 2/2 Running 0 79m 192.168.60.34 dell-34 <none> <none>
csi-rbdplugin-holder-rook-ceph-5stln 1/1 Running 0 8h 10.244.1.114 dell-41 <none> <none>
csi-rbdplugin-holder-rook-ceph-6hbr4 1/1 Running 0 8h 10.244.0.77 dell-34 <none> <none>
csi-rbdplugin-holder-rook-ceph-h9n2t 1/1 Running 0 8h 10.244.2.133 dell-42 <none> <none>
csi-rbdplugin-kfbbm 2/2 Running 0 79m 192.168.60.41 dell-41 <none> <none>
csi-rbdplugin-provisioner-5486f64f-n29bw 5/5 Running 0 80m 10.244.0.253 dell-34 <none> <none>
csi-rbdplugin-provisioner-5486f64f-vhnrb 5/5 Running 0 80m 10.244.1.174 dell-41 <none> <none>
csi-rbdplugin-rwzsp 2/2 Running 0 80m 192.168.60.42 dell-42 <none> <none>
rook-ceph-crashcollector-dell-34-dcb8f9b64-gfws7 1/1 Running 0 80m 192.168.60.34 dell-34 <none> <none>
rook-ceph-crashcollector-dell-41-89858dfd4-4j6vw 1/1 Running 0 80m 192.168.60.41 dell-41 <none> <none>
rook-ceph-crashcollector-dell-42-5859d9f6df-l4sfq 1/1 Running 0 80m 192.168.60.42 dell-42 <none> <none>
rook-ceph-mgr-a-548cd8479b-dtmhp 2/2 Running 1 74m 192.168.60.34 dell-34 <none> <none>
rook-ceph-mgr-b-8c6b55bdd-6kpsf 2/2 Running 1 72m 192.168.60.41 dell-41 <none> <none>
rook-ceph-mon-g-6ff656cbf4-lhc52 1/1 Running 0 42m 192.168.60.34 dell-34 <none> <none>
rook-ceph-mon-h-568f9b87f7-s6ff6 1/1 Running 0 37m 192.168.60.42 dell-42 <none> <none>
rook-ceph-mon-i-bf66847bb-krqrl 1/1 Running 0 26m 192.168.60.41 dell-41 <none> <none>
rook-ceph-operator-647948646c-892gk 1/1 Running 0 8h 10.244.0.124 dell-34 <none> <none>
rook-ceph-osd-0-656fcc879b-h5q9b 1/1 Running 0 71m 192.168.60.41 dell-41 <none> <none>
rook-ceph-osd-1-b8888465f-lvsxh 1/1 Running 0 69m 192.168.60.34 dell-34 <none> <none>
rook-ceph-osd-2-786d49fc44-fcqf4 1/1 Running 0 68m 192.168.60.42 dell-42 <none> <none>
rook-ceph-osd-prepare-dell-34-gj46v 0/1 Completed 0 25m 192.168.60.34 dell-34 <none> <none>
rook-ceph-osd-prepare-dell-41-nc7k6 0/1 Completed 0 25m 192.168.60.41 dell-41 <none> <none>
rook-ceph-osd-prepare-dell-42-9pgsx 0/1 Completed 0 24m 192.168.60.42 dell-42 <none> <none>
rook-ceph-tools-555c879675-pk597 1/1 Running 0 8h 10.244.1.112 dell-41 <none> <none>
ceph status :
root@dell-34:~# kubectl exec -it -n rook-ceph rook-ceph-tools-555c879675-pk597 -- ceph -s
cluster:
id: fd15bc40-b2ff-446d-a033-cf28ee416873
health: HEALTH_OK
services:
mon: 3 daemons, quorum g,h,i (age 29m)
mgr: a(active, since 28m), standbys: b
osd: 3 osds: 3 up (since 70m), 3 in (since 8h)
data:
pools: 4 pools, 97 pgs
objects: 10.42k objects, 36 GiB
usage: 29 TiB used, 29 TiB / 58 TiB avail
pgs: 97 active+clean
@wanghui-devops you can try rebooting the nodes. That should fix the pv mount issue.
Velero backup task is in progress. I can't restart the node yet. I'll try it after backup is completed.
@wanghui-devops were you able to get this working?
@sp98 After the reboot, it's still the same error.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m25s default-scheduler Successfully assigned default/busybox-rbd-pool-1 to dell-34
Normal SuccessfulAttachVolume 9m25s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-2150e55e-4b06-4e96-aba8-0ebad8d8726c"
Warning FailedMount 7m9s kubelet MountVolume.MountDevice failed for volume "pvc-2150e55e-4b06-4e96-aba8-0ebad8d8726c" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedMount 5m8s kubelet Unable to attach or mount volumes: unmounted volumes=[my-pv-volume], unattached volumes=[default-token-9vnkn my-pv-volume]: timed out waiting for the condition
Warning FailedMount 57s (x10 over 7m9s) kubelet MountVolume.MountDevice failed for volume "pvc-2150e55e-4b06-4e96-aba8-0ebad8d8726c" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000008-b81422a1-1535-4777-ac61-3bcb7eef0900 already exists
Warning FailedMount 34s (x3 over 7m22s) kubelet Unable to attach or mount volumes: unmounted volumes=[my-pv-volume], unattached volumes=[my-pv-volume default-token-9vnkn]: timed out waiting for the condition
The dmesg
at this point:
[Sun Oct 8 15:11:28 2023] libceph: mon2 192.168.60.41:6789 session established
[Sun Oct 8 15:11:28 2023] libceph: mon2 192.168.60.41:6789 socket closed (con state OPEN)
[Sun Oct 8 15:11:28 2023] libceph: mon2 192.168.60.41:6789 session lost, hunting for new mon
[Sun Oct 8 15:11:28 2023] libceph: mon1 192.168.60.41:6789 session established
[Sun Oct 8 15:11:28 2023] libceph: client994145 fsid fd15bc40-b2ff-446d-a033-cf28ee416873
[Sun Oct 8 15:11:30 2023] libceph: osd1 192.168.60.34:6801 socket closed (con state CONNECTING)
[Sun Oct 8 15:11:33 2023] libceph: osd1 192.168.60.34:6801 socket closed (con state CONNECTING)
[Sun Oct 8 15:11:36 2023] libceph: osd1 192.168.60.34:6801 socket closed (con state CONNECTING)
[Sun Oct 8 15:11:39 2023] libceph: osd1 192.168.60.34:6801 socket closed (con state CONNECTING)
[Sun Oct 8 15:11:47 2023] libceph: osd1 192.168.60.34:6801 socket closed (con state CONNECTING)
[Sun Oct 8 15:11:58 2023] libceph: osd1 192.168.60.34:6801 socket closed (con state CONNECTING)
[Sun Oct 8 15:12:16 2023] libceph: osd1 192.168.60.34:6801 socket closed (con state CONNECTING)
[Sun Oct 8 15:12:53 2023] libceph: osd1 192.168.60.34:6801 socket closed (con state CONNECTING)
[Sun Oct 8 15:14:01 2023] libceph: osd1 192.168.60.34:6801 socket closed (con state CONNECTING)
[Sun Oct 8 15:14:24 2023] INFO: task mapper:5375 blocked for more than 120 seconds.
[Sun Oct 8 15:14:24 2023] Tainted: G W 4.15.0-175-generic #184-Ubuntu
[Sun Oct 8 15:14:24 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
MountVolume.MountDevice failed for volume "pvc-2150e55e-4b06-4e96-aba8-0ebad8d8726c" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000008-b81422a1-1535-4777-ac61-3bcb7eef0900 already exists
hi @wanghui-devops, if the cluster error or network problems will cause some commands to hang, this prompt message will be reported. So you can try the following steps.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
rook-ceph version : v1.7 I'm trying to switch host networks on a single-node ceph cluster using multus, is there any guidance or documentation to help me do this? Or tell me that if I rebuild on the original cluster, the premise is that the data is still intact after reconstruction;