Open molpako opened 6 months ago
https://docs.ceph.com/en/quincy/rados/operations/health-checks/#pg-degraded より
一部データで必要なレプリカ数(ECなら分割したデータブロック)を持ってない状態になっている。
ほとんどのケースではOSDがダウンしていることが原因みたいだ
pgid 7.0 の状態を確認してみる。PGのマッピングにOSDが見つからない場合 2147483647 と表示される。
ceph health detail
では NONE
となっている
root@mon2:/# ceph tell 7.0 query
{
"snap_trimq": "[]",
"snap_trimq_len": 0,
"state": "active+undersized",
"epoch": 895,
"up": [
5,
9,
3,
2147483647
],
"acting": [
5,
9,
3,
2147483647
],
"acting_recovery_backfill": [
"3(2)",
"5(0)",
"9(1)"
],
pgid から使用しているプールを把握することができる
{pool-num}.{pg-id}
なので pool-num は 7。
pool 7 は cephfs_data_ec
root@mon2:/# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 677 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 11.11
pool 5 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 677 lfor 0/0/522 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 2.06
pool 6 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 860 lfor 0/0/608 flags hashpspool stripe_width 0 application cephfs read_balance_score 1.63
pool 7 'cephfs_data_ec' erasure profile default size 4 min_size 3 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 677 lfor 0/0/541 flags hashpspool stripe_width 8192 application cephfs
root@mon2:/# ceph osd pool stats cephfs_data_ec
pool cephfs_data_ec id 7
nothing is going on
default size 4
のため OSD が 4つ必要だったが、足りなかったため、PG_DEGRADED になったみたいだ
refs: cephfs へECプールの追加
root@mon2:/# rados -p cephfs_data_ec ls
root@mon2:/# ceph pg map 7.0
osdmap e895 pg 7.0 (7.0) -> up [5,9,3,2147483647] acting [5,9,3,2147483647]
root@mon2:/# ^C
root@mon2:/# rados -p cephfs_data_ec ls
root@mon2:/#
まだオブジェクトは入ってない
クラスタにOSDを追加してみた。ホスト追加。
root@mon2:/# ceph orch host ls
HOST ADDR LABELS STATUS
client1 192.168.68.122 clients
mds1 192.168.68.119 mds
mds2 192.168.68.118 mds
mds3 192.168.68.125 mds
mon2 192.168.68.116 monitor,_admin
mon3 192.168.68.112 monitor
osd1 192.168.68.127 osd
osd2 192.168.68.123 osd
osd3 192.168.68.121 osd
osd4 192.168.68.111 osd
10 hosts in cluster
osd11, osd12, osd13 が追加された
root@mon2:/# ceph orch ps --daemon_type osd
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
osd.0 osd2 running (9d) 8m ago 9d 361M 4096M 18.2.2 6dc5f0faebb2 66899c35de74
osd.1 osd3 running (9d) 4m ago 9d 312M 4096M 18.2.2 6dc5f0faebb2 fee07f50fef9
osd.2 osd1 running (15h) 8m ago 9d 205M 4096M 18.2.2 6dc5f0faebb2 c7471d8ed7c8
osd.3 osd2 running (9d) 8m ago 9d 405M 4096M 18.2.2 6dc5f0faebb2 492b1e9a0b47
osd.4 osd3 running (9d) 4m ago 9d 314M 4096M 18.2.2 6dc5f0faebb2 f306c6bbc039
osd.5 osd1 running (15h) 8m ago 9d 197M 4096M 18.2.2 6dc5f0faebb2 8f45b04ae751
osd.6 osd2 running (18h) 8m ago 9d 353M 4096M 18.2.2 6dc5f0faebb2 0be213ec5fa0
osd.7 osd3 running (9d) 4m ago 9d 297M 4096M 18.2.2 6dc5f0faebb2 3b3a879f1203
osd.8 osd1 running (15h) 8m ago 9d 207M 4096M 18.2.2 6dc5f0faebb2 018f7fe5840f
osd.9 osd3 running (9d) 4m ago 9d 408M 4096M 18.2.2 6dc5f0faebb2 ad0f5df41e64
osd.10 osd1 running (15h) 8m ago 9d 201M 4096M 18.2.2 6dc5f0faebb2 44e8d1eea7f2
osd.11 osd4 running (4m) 27s ago 4m 373M 4096M 18.2.2 6dc5f0faebb2 1c36f2be65d5
osd.12 osd4 running (4m) 27s ago 4m 363M 4096M 18.2.2 6dc5f0faebb2 a85f666f1310
osd.13 osd4 running (4m) 27s ago 4m 357M 4096M 18.2.2 6dc5f0faebb2 e66a78d74d4c
追加中のログ
追加中、pool 7 の PG_DEGRADED は消えていて、pool 6 cephfs_data の remapped が出ていた
root@mon2:/# ceph health detail
HEALTH_WARN Reduced data availability: 1 pg peering; Degraded data redundancy: 9540/29934 objects degraded (31.870%), 91 pgs degraded
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg peering
pg 6.7 is stuck peering for 15h, current state peering, last acting [11,3,8]
[WRN] PG_DEGRADED: Degraded data redundancy: 9540/29934 objects degraded (31.870%), 91 pgs degraded
pg 6.0 is active+recovery_wait+undersized+degraded+remapped, acting [5,1]
pg 6.1 is active+recovery_wait+undersized+degraded+remapped, acting [1,10]
pg 6.2 is active+recovery_wait+undersized+degraded+remapped, acting [5,4]
pg 6.3 is active+recovery_wait+undersized+degraded+remapped, acting [8,1]
pg 6.4 is active+recovery_wait+undersized+degraded+remapped, acting [10,6]
remapped が終わったのか、WARNは消えていた
root@mon2:/# ceph health detail
HEALTH_OK
osd追加する前の osd df
root@mon2:/# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
2 hdd 0.02910 1.00000 30 GiB 9.2 GiB 9.2 GiB 6 KiB 54 MiB 21 GiB 30.99 0.66 49 up
5 hdd 0.02910 1.00000 30 GiB 10 GiB 10 GiB 11 KiB 44 MiB 20 GiB 33.62 0.72 49 up
8 hdd 0.02910 1.00000 30 GiB 11 GiB 11 GiB 4 KiB 49 MiB 19 GiB 36.42 0.78 49 up
10 hdd 0.02910 1.00000 30 GiB 8.7 GiB 8.7 GiB 9 KiB 42 MiB 21 GiB 29.30 0.63 46 up
0 hdd 0.01819 1.00000 19 GiB 12 GiB 12 GiB 0 B 136 MiB 6.4 GiB 65.59 1.40 57 up
3 hdd 0.01819 1.00000 19 GiB 14 GiB 13 GiB 0 B 144 MiB 5.1 GiB 72.72 1.55 84 up
6 hdd 0.01819 1.00000 19 GiB 13 GiB 13 GiB 5 KiB 102 MiB 5.4 GiB 71.17 1.52 61 up
1 hdd 0.01819 1.00000 19 GiB 8.6 GiB 8.5 GiB 0 B 109 MiB 10 GiB 46.32 0.99 45 up
4 hdd 0.01819 1.00000 19 GiB 10 GiB 10 GiB 0 B 113 MiB 8.4 GiB 55.04 1.17 42 up
7 hdd 0.01819 1.00000 19 GiB 10 GiB 10 GiB 0 B 114 MiB 8.6 GiB 54.07 1.15 58 up
9 hdd 0.01819 1.00000 19 GiB 10 GiB 10 GiB 0 B 122 MiB 8.5 GiB 54.52 1.16 55 up
TOTAL 250 GiB 117 GiB 116 GiB 38 KiB 1.0 GiB 133 GiB 46.86
MIN/MAX VAR: 0.63/1.55 STDDEV: 15.47
osd 追加した後の osd df
root@mon2:/# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
2 hdd 0.02910 1.00000 30 GiB 9.2 GiB 9.1 GiB 6 KiB 77 MiB 21 GiB 30.89 0.88 51 up
5 hdd 0.02910 1.00000 30 GiB 8.9 GiB 8.8 GiB 11 KiB 83 MiB 21 GiB 29.81 0.85 40 up
8 hdd 0.02910 1.00000 30 GiB 9.3 GiB 9.3 GiB 4 KiB 88 MiB 20 GiB 31.36 0.89 44 up
10 hdd 0.02910 1.00000 30 GiB 9.2 GiB 9.1 GiB 9 KiB 85 MiB 21 GiB 30.74 0.87 46 up
0 hdd 0.01819 1.00000 19 GiB 6.9 GiB 6.8 GiB 0 B 140 MiB 12 GiB 37.08 1.05 36 up
3 hdd 0.01819 1.00000 19 GiB 8.0 GiB 7.8 GiB 0 B 179 MiB 11 GiB 42.92 1.22 49 up
6 hdd 0.01819 1.00000 19 GiB 7.3 GiB 7.2 GiB 5 KiB 107 MiB 11 GiB 39.02 1.11 38 up
1 hdd 0.01819 1.00000 19 GiB 6.8 GiB 6.7 GiB 0 B 114 MiB 12 GiB 36.51 1.04 37 up
4 hdd 0.01819 1.00000 19 GiB 7.8 GiB 7.7 GiB 0 B 118 MiB 11 GiB 41.86 1.19 39 up
7 hdd 0.01819 1.00000 19 GiB 6.9 GiB 6.8 GiB 0 B 136 MiB 12 GiB 36.97 1.05 41 up
9 hdd 0.01819 1.00000 19 GiB 7.2 GiB 7.0 GiB 0 B 144 MiB 11 GiB 38.39 1.09 35 up
11 hdd 0.02730 1.00000 28 GiB 9.9 GiB 9.8 GiB 0 B 97 MiB 18 GiB 35.44 1.01 46 up
12 hdd 0.02730 1.00000 28 GiB 10 GiB 10 GiB 0 B 79 MiB 18 GiB 36.26 1.03 61 up
13 hdd 0.02730 1.00000 28 GiB 10 GiB 10 GiB 0 B 95 MiB 18 GiB 36.15 1.03 48 up
TOTAL 333 GiB 118 GiB 116 GiB 38 KiB 1.5 GiB 216 GiB 35.25
MIN/MAX VAR: 0.85/1.22 STDDEV: 3.97
新しく追加した osd に均されている
割り当てるOSDがなかったPGも新しく追加OSDを使い始めている
root@mon2:/# ceph tell 7.0 query
{
"snap_trimq": "[]",
"snap_trimq_len": 0,
"state": "active+clean",
"epoch": 1028,
"up": [
5,
9,
3,
12
],
"acting": [
5,
9,
3,
12
],
PG_DEGRADED が出ていたので対処してみる