rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.32k stars 2.69k forks source link

why osd Only half used. #13817

Closed wanghui-devops closed 4 months ago

wanghui-devops commented 7 months ago

i have 6 ssd osd , but just used 3

[rook@rook-ceph-tools-555c879675-s84nt tmp]$ ceph osd df | grep ssd
 9    ssd   1.81940   0.95001  1.8 TiB  1.7 TiB  1.7 TiB   35 KiB  3.2 GiB   94 GiB  94.97  2.74    1      up
12    ssd   1.81940   1.00000  1.8 TiB   64 MiB   16 MiB   10 KiB   48 MiB  1.8 TiB   0.00     0    1      up
10    ssd   1.81940   1.00000  1.8 TiB   22 MiB   16 MiB    8 KiB  6.2 MiB  1.8 TiB   0.00     0    1      up
13    ssd   1.81940   0.95001  1.8 TiB  1.7 TiB  1.7 TiB   13 KiB  3.2 GiB   94 GiB  94.97  2.74    2      up
11    ssd   1.81940   1.00000  1.8 TiB   28 MiB   16 MiB   23 KiB   13 MiB  1.8 TiB   0.00     0    0      up
14    ssd   1.81940   0.95001  1.8 TiB  1.7 TiB  1.7 TiB    3 KiB  3.1 GiB   94 GiB  94.97  2.74    1      up

osd.12 , osd.10 ,osd.11 hardly used ;

rule :

rule ssd-ec-data-pool-middle-core {
        id 16
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class ssd
        step chooseleaf indep 0 type osd
        step emit
}

One more question: Why is the pool only 3.5T?

travisn commented 7 months ago

The PG count is very low. Try increasing the number of PGs in the pools and check if the autoscaler needs to be enabled.

wanghui-devops commented 7 months ago

Yes, the root cause is that pg num and pgp num have values of 1, resulting in only storing data to 3 osds. But what I don't understand is that this pool has PG Autoscale turned on all the time. When does it not take effect? I fixed this problem by manually setting pg num and pgp num and turning on balance;

ceph osd pool set mypool pg_num 128
ceph osd pool set mypool  pgp_num 128
ceph balancer mode upmap
ceph balancer on
subhamkrai commented 7 months ago

I have observed this pg_num/pgb_num 1 but initially it is assigned 1 but later it changed to at least 32. Can you verify this?

subhamkrai commented 7 months ago
ceph osd pool ls detail

pool 2 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 20 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 5 'replicapool' replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 58 lfor 0/0/56 flags hashpspool,selfmanaged_snaps stripe_width 0 compression_mode none application rbd
pool 6 'replicapool1' replicated size 1 min_size 1 crush_rule 5 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 69 lfor 0/0/67 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 7 'replicapool2' replicated size 1 min_size 1 crush_rule 6 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 76 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

If you see above replicapool2 is pg/pgp_num is 1

here is the mgr logs

debug 2024-02-22T10:49:58.219+0000 7f80f7fd7640  0 [pg_autoscaler INFO root] Pool 'replicapool2' root_id -1 using 1.152026622245709e-11 of space, bias 1.0, pg target 8.640199666842818e-10 quantized to 32 (current 1)

But later I got the right num

ceph osd pool ls detail
pool 2 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 20 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 5 'replicapool' replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 58 lfor 0/0/56 flags hashpspool,selfmanaged_snaps stripe_width 0 compression_mode none application rbd
pool 6 'replicapool1' replicated size 1 min_size 1 crush_rule 5 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 69 lfor 0/0/67 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 7 'replicapool2' replicated size 1 min_size 1 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 81 lfor 0/0/79 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

mgr logs

debug 2024-02-22T10:51:58.294+0000 7f80f7fd7640  0 [pg_autoscaler INFO root] Pool 'replicapool2' root_id -1 using 1.152026622245709e-11 of space, bias 1.0, pg target 8.640199666842818e-10 quantized to 32 (current 32)

@wanghui-devops

wanghui-devops commented 7 months ago

I see from the log that it should be caused by overlapping root , how to fix it?

debug 2024-02-27T03:05:47.174+0000 7f7e47b87700  0 [pg_autoscaler ERROR root] pool 20 has overlapping roots: {-12, -1, -2}
debug 2024-02-27T03:05:47.178+0000 7f7e47b87700  0 [pg_autoscaler WARNING root] pool 20 contains an overlapping root -12... skipping scaling
wanghui-devops commented 7 months ago

@subhamkrai

wanghui-devops commented 7 months ago

this is my clush-rule . Can you see what caused the problem?

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host k8s-rook2 {
    id -3       # do not change unnecessarily
    id -4 class hdd     # do not change unnecessarily
    id -9 class ssd     # do not change unnecessarily
    # weight 45.91138
    alg straw2
    hash 0  # rjenkins1
    item osd.1 weight 43.91138
    item osd.13 weight 1.00000
    item osd.10 weight 1.00000
}
host k8s-rook3 {
    id -5       # do not change unnecessarily
    id -6 class hdd     # do not change unnecessarily
    id -10 class ssd        # do not change unnecessarily
    # weight 45.91138
    alg straw2
    hash 0  # rjenkins1
    item osd.2 weight 43.91138
    item osd.11 weight 1.00000
    item osd.14 weight 1.00000
}
host k8s-rook1 {
    id -7       # do not change unnecessarily
    id -8 class hdd     # do not change unnecessarily
    id -11 class ssd        # do not change unnecessarily
    # weight 45.91138
    alg straw2
    hash 0  # rjenkins1
    item osd.0 weight 43.91138
    item osd.12 weight 1.00000
    item osd.9 weight 1.00000
}
root default {
    id -1       # do not change unnecessarily
    id -2 class hdd     # do not change unnecessarily
    id -12 class ssd        # do not change unnecessarily
    # weight 137.73413
    alg straw2
    hash 0  # rjenkins1
    item k8s-rook2 weight 45.91138
    item k8s-rook3 weight 45.91138
    item k8s-rook1 weight 45.91138
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated-metadata-pool-middle-server {
    id 1
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule ec-data-pool-middle-server {
    id 2
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default class hdd
    step chooseleaf indep 0 type host
    step emit
}
rule myfs-metadata {
    id 3
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule myfs-replicated {
    id 4
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated-metadata-pool-kubesphere-system {
    id 5
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule ec-data-pool-kubesphere-system {
    id 6
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default class hdd
    step chooseleaf indep 0 type host
    step emit
}
rule replicated-metadata-pool-nacos-system {
    id 7
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule ec-data-pool-nacos-system {
    id 8
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default
    step chooseleaf indep 0 type host
    step emit
}
rule replicated-metadata-pool-middle-system {
    id 9
    type replicated
    step take default
    step chooseleaf firstn 0 type osd
    step emit
}
rule ec-data-pool-middle-system {
    id 10
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default class hdd
    step chooseleaf indep 0 type host
    step emit
}
rule replicated-metadata-pool-middle-core {
    id 11
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule ec-data-pool-middle-core {
    id 12
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default class hdd
    step chooseleaf indep 0 type host
    step emit
}
rule ssd-rule {
    id 13
    type replicated
    step take default class ssd
    step chooseleaf firstn 0 type host
    step emit
}
rule hdd-rule {
    id 14
    type replicated
    step take default class hdd
    step chooseleaf firstn 0 type host
    step emit
}
rule ssd-replicated-metadata-pool-middle-core {
    id 15
    type replicated
    step take default class ssd
    step chooseleaf firstn 0 type host
    step emit
}
rule ssd-ec-data-pool-middle-core {
    id 16
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default class ssd
    step chooseleaf indep 0 type osd
    step emit
}
rule replicated-metadata-pool-sso-system {
    id 17
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule ec-data-pool-sso-system {
    id 18
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default
    step chooseleaf indep 0 type host
    step emit
}
# end crush map

@subhamkrai

wanghui-devops commented 7 months ago

I've fixed the overlapping root problem because at least one pool is still assigned "replicated_rule", which doesn't distinguish device classes from OSD. After I modified crush_rule, pg_autoscale works; @subhamkrai

subhamkrai commented 7 months ago

great to know @wanghui-devops. Are we good to close the issue?

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 4 months ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.