rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.21k stars 2.68k forks source link

[1.14.8] OSD X is not ok-to-stop while creating cluster #14425

Open pasztorl opened 1 month ago

pasztorl commented 1 month ago

I've created a cluster with 3 nodes. The blockpool replicated to 3 nodes. By default .mgr only have one replica, so the operator can't finish the cluster installation. This log entry repeating: 2024-07-05 13:08:07.357184 I | op-osd: OSD 0 is not ok-to-stop. will try updating it again later

After I do: ceph osd pool set .mgr size 3

The operator finishes the installation:

2024-07-05 13:08:08.902544 I | op-osd: updating OSD 0 on node "xxx"
2024-07-05 13:08:09.312462 I | cephclient: successfully disallowed pre-reef osds and enabled all new reef-only functionality
2024-07-05 13:08:09.494508 I | op-osd: finished running OSDs in namespace "storage"
2024-07-05 13:08:09.494516 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "storage"
2024-07-05 13:08:09.505474 I | ceph-cluster-controller: reporting cluster telemetry
2024-07-05 13:08:12.917768 I | ceph-cluster-controller: reporting node telemetry

I not found an option for .mgr to set the replication level in the helm chart. What is the best practice about this situation?

travisn commented 1 month ago

By default pools should have 3 replicas. Did you change some setting so it would default to 1 replica?

To ensure the replication of the .mgr pool is set, you need to specify its CephBlockPool under cephBlockPools: in the helm values. An example of the spec is in pool-builtin-mgr.yaml.

pasztorl commented 1 month ago

@travisn thanks for the quick answer. I have the replicated: 3 in the values and all pools got replicaX3 but .mgr not.

sp98 commented 1 month ago

@pasztorl How are you deploying the cluster? Which rook version is it? By default, .mgr pool is created for 3 replicas.


sh-4.4$ ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 3.03

Have you deployed using the [cluster-test.yaml](https://github.com/rook/rook/blob/master/deploy/examples/cluster-test.yaml#L65)

pasztorl commented 1 month ago

Hi,

I've used the helm chart 1.14.2

I've attached the values. ceph.values.txt

pasztorl commented 4 weeks ago

@sp98 am I missed something from the values?