Recovery PG_DEGRADED - Githubissues

molpako commented 6 months ago

PG_DEGRADED が出ていたので対処してみる

root@mon2:/# ceph health detail
HEALTH_WARN Degraded data redundancy: 16 pgs undersized
[WRN] PG_DEGRADED: Degraded data redundancy: 16 pgs undersized
    pg 7.0 is stuck undersized for 14h, current state active+undersized, last acting [5,9,3,NONE]
    pg 7.2 is stuck undersized for 14h, current state active+undersized, last acting [3,7,10,NONE]
    pg 7.3 is stuck undersized for 14h, current state active+undersized, last acting [3,4,2,NONE]
    pg 7.b is stuck undersized for 14h, current state active+undersized, last acting [3,2,9,NONE]
    pg 7.c is stuck undersized for 14h, current state active+undersized, last acting [2,3,9,NONE]
    pg 7.d is stuck undersized for 14h, current state active+undersized, last acting [10,6,7,NONE]
    pg 7.f is stuck undersized for 14h, current state active+undersized, last acting [7,0,8,NONE]
    pg 7.11 is stuck undersized for 14h, current state active+undersized, last acting [2,9,3,NONE]
    pg 7.12 is stuck undersized for 14h, current state active+undersized, last acting [10,3,1,NONE]
    pg 7.15 is stuck undersized for 14h, current state active+undersized, last acting [10,3,1,NONE]
    pg 7.18 is stuck undersized for 14h, current state active+undersized, last acting [0,7,5,NONE]
    pg 7.1a is stuck undersized for 14h, current state active+undersized, last acting [2,7,0,NONE]
    pg 7.1c is stuck undersized for 14h, current state active+undersized, last acting [2,9,6,NONE]
    pg 7.1d is stuck undersized for 14h, current state active+undersized, last acting [7,10,6,NONE]
    pg 7.1e is stuck undersized for 14h, current state active+undersized, last acting [3,10,1,NONE]
    pg 7.1f is stuck undersized for 14h, current state active+undersized, last acting [5,1,6,NONE]

molpako commented 6 months ago

PG_DEGRADED とは

https://docs.ceph.com/en/quincy/rados/operations/health-checks/#pg-degraded より

一部データで必要なレプリカ数（ECなら分割したデータブロック）を持ってない状態になっている。

ほとんどのケースではOSDがダウンしていることが原因みたいだ

molpako commented 6 months ago

PG 状態確認

pgid 7.0 の状態を確認してみる。PGのマッピングにOSDが見つからない場合 2147483647 と表示される。 ceph health detail では NONE となっている

root@mon2:/# ceph tell 7.0 query
{
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "state": "active+undersized",
    "epoch": 895,
    "up": [
        5,
        9,
        3,
        2147483647
    ],
    "acting": [
        5,
        9,
        3,
        2147483647
    ],
    "acting_recovery_backfill": [
        "3(2)",
        "5(0)",
        "9(1)"
    ],

molpako commented 6 months ago

pgid から使用しているプールを把握することができる

{pool-num}.{pg-id}

なので pool-num は 7。

molpako commented 6 months ago

pool 7 は cephfs_data_ec

root@mon2:/# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 677 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 11.11
pool 5 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 677 lfor 0/0/522 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 2.06
pool 6 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 860 lfor 0/0/608 flags hashpspool stripe_width 0 application cephfs read_balance_score 1.63
pool 7 'cephfs_data_ec' erasure profile default size 4 min_size 3 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 677 lfor 0/0/541 flags hashpspool stripe_width 8192 application cephfs

root@mon2:/# ceph osd pool stats cephfs_data_ec
pool cephfs_data_ec id 7
  nothing is going on

default size 4 のため OSD が 4つ必要だったが、足りなかったため、PG_DEGRADED になったみたいだ

refs: cephfs へECプールの追加

https://docs.ceph.com/en/latest/cephfs/createfs/?highlight=allow_ec_overwrites#using-erasure-coded-pools-with-cephfs

molpako commented 6 months ago

root@mon2:/# rados -p cephfs_data_ec ls
root@mon2:/# ceph pg map 7.0
osdmap e895 pg 7.0 (7.0) -> up [5,9,3,2147483647] acting [5,9,3,2147483647]
root@mon2:/# ^C
root@mon2:/# rados -p cephfs_data_ec ls
root@mon2:/#

まだオブジェクトは入ってない

molpako commented 6 months ago

OSD追加

クラスタにOSDを追加してみた。ホスト追加。

root@mon2:/# ceph orch host ls
HOST     ADDR            LABELS          STATUS
client1  192.168.68.122  clients
mds1     192.168.68.119  mds
mds2     192.168.68.118  mds
mds3     192.168.68.125  mds
mon2     192.168.68.116  monitor,_admin
mon3     192.168.68.112  monitor
osd1     192.168.68.127  osd
osd2     192.168.68.123  osd
osd3     192.168.68.121  osd
osd4     192.168.68.111  osd
10 hosts in cluster

osd11, osd12, osd13 が追加された

root@mon2:/# ceph orch ps --daemon_type osd
NAME    HOST  PORTS  STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
osd.0   osd2         running (9d)      8m ago   9d     361M    4096M  18.2.2   6dc5f0faebb2  66899c35de74
osd.1   osd3         running (9d)      4m ago   9d     312M    4096M  18.2.2   6dc5f0faebb2  fee07f50fef9
osd.2   osd1         running (15h)     8m ago   9d     205M    4096M  18.2.2   6dc5f0faebb2  c7471d8ed7c8
osd.3   osd2         running (9d)      8m ago   9d     405M    4096M  18.2.2   6dc5f0faebb2  492b1e9a0b47
osd.4   osd3         running (9d)      4m ago   9d     314M    4096M  18.2.2   6dc5f0faebb2  f306c6bbc039
osd.5   osd1         running (15h)     8m ago   9d     197M    4096M  18.2.2   6dc5f0faebb2  8f45b04ae751
osd.6   osd2         running (18h)     8m ago   9d     353M    4096M  18.2.2   6dc5f0faebb2  0be213ec5fa0
osd.7   osd3         running (9d)      4m ago   9d     297M    4096M  18.2.2   6dc5f0faebb2  3b3a879f1203
osd.8   osd1         running (15h)     8m ago   9d     207M    4096M  18.2.2   6dc5f0faebb2  018f7fe5840f
osd.9   osd3         running (9d)      4m ago   9d     408M    4096M  18.2.2   6dc5f0faebb2  ad0f5df41e64
osd.10  osd1         running (15h)     8m ago   9d     201M    4096M  18.2.2   6dc5f0faebb2  44e8d1eea7f2
osd.11  osd4         running (4m)     27s ago   4m     373M    4096M  18.2.2   6dc5f0faebb2  1c36f2be65d5
osd.12  osd4         running (4m)     27s ago   4m     363M    4096M  18.2.2   6dc5f0faebb2  a85f666f1310
osd.13  osd4         running (4m)     27s ago   4m     357M    4096M  18.2.2   6dc5f0faebb2  e66a78d74d4c

追加中のログ

追加中、pool 7 の PG_DEGRADED は消えていて、pool 6 cephfs_data の remapped が出ていた

root@mon2:/# ceph health detail
HEALTH_WARN Reduced data availability: 1 pg peering; Degraded data redundancy: 9540/29934 objects degraded (31.870%), 91 pgs degraded
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg peering
    pg 6.7 is stuck peering for 15h, current state peering, last acting [11,3,8]
[WRN] PG_DEGRADED: Degraded data redundancy: 9540/29934 objects degraded (31.870%), 91 pgs degraded
    pg 6.0 is active+recovery_wait+undersized+degraded+remapped, acting [5,1]
    pg 6.1 is active+recovery_wait+undersized+degraded+remapped, acting [1,10]
    pg 6.2 is active+recovery_wait+undersized+degraded+remapped, acting [5,4]
    pg 6.3 is active+recovery_wait+undersized+degraded+remapped, acting [8,1]
    pg 6.4 is active+recovery_wait+undersized+degraded+remapped, acting [10,6]

molpako commented 6 months ago

remapped が終わったのか、WARNは消えていた

root@mon2:/# ceph health detail
HEALTH_OK

osd追加する前の osd df

root@mon2:/# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP    META     AVAIL    %USE   VAR   PGS  STATUS
 2    hdd  0.02910   1.00000   30 GiB  9.2 GiB  9.2 GiB   6 KiB   54 MiB   21 GiB  30.99  0.66   49      up
 5    hdd  0.02910   1.00000   30 GiB   10 GiB   10 GiB  11 KiB   44 MiB   20 GiB  33.62  0.72   49      up
 8    hdd  0.02910   1.00000   30 GiB   11 GiB   11 GiB   4 KiB   49 MiB   19 GiB  36.42  0.78   49      up
10    hdd  0.02910   1.00000   30 GiB  8.7 GiB  8.7 GiB   9 KiB   42 MiB   21 GiB  29.30  0.63   46      up
 0    hdd  0.01819   1.00000   19 GiB   12 GiB   12 GiB     0 B  136 MiB  6.4 GiB  65.59  1.40   57      up
 3    hdd  0.01819   1.00000   19 GiB   14 GiB   13 GiB     0 B  144 MiB  5.1 GiB  72.72  1.55   84      up
 6    hdd  0.01819   1.00000   19 GiB   13 GiB   13 GiB   5 KiB  102 MiB  5.4 GiB  71.17  1.52   61      up
 1    hdd  0.01819   1.00000   19 GiB  8.6 GiB  8.5 GiB     0 B  109 MiB   10 GiB  46.32  0.99   45      up
 4    hdd  0.01819   1.00000   19 GiB   10 GiB   10 GiB     0 B  113 MiB  8.4 GiB  55.04  1.17   42      up
 7    hdd  0.01819   1.00000   19 GiB   10 GiB   10 GiB     0 B  114 MiB  8.6 GiB  54.07  1.15   58      up
 9    hdd  0.01819   1.00000   19 GiB   10 GiB   10 GiB     0 B  122 MiB  8.5 GiB  54.52  1.16   55      up
                       TOTAL  250 GiB  117 GiB  116 GiB  38 KiB  1.0 GiB  133 GiB  46.86
MIN/MAX VAR: 0.63/1.55  STDDEV: 15.47

osd 追加した後の osd df

root@mon2:/# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP    META     AVAIL    %USE   VAR   PGS  STATUS
 2    hdd  0.02910   1.00000   30 GiB  9.2 GiB  9.1 GiB   6 KiB   77 MiB   21 GiB  30.89  0.88   51      up
 5    hdd  0.02910   1.00000   30 GiB  8.9 GiB  8.8 GiB  11 KiB   83 MiB   21 GiB  29.81  0.85   40      up
 8    hdd  0.02910   1.00000   30 GiB  9.3 GiB  9.3 GiB   4 KiB   88 MiB   20 GiB  31.36  0.89   44      up
10    hdd  0.02910   1.00000   30 GiB  9.2 GiB  9.1 GiB   9 KiB   85 MiB   21 GiB  30.74  0.87   46      up
 0    hdd  0.01819   1.00000   19 GiB  6.9 GiB  6.8 GiB     0 B  140 MiB   12 GiB  37.08  1.05   36      up
 3    hdd  0.01819   1.00000   19 GiB  8.0 GiB  7.8 GiB     0 B  179 MiB   11 GiB  42.92  1.22   49      up
 6    hdd  0.01819   1.00000   19 GiB  7.3 GiB  7.2 GiB   5 KiB  107 MiB   11 GiB  39.02  1.11   38      up
 1    hdd  0.01819   1.00000   19 GiB  6.8 GiB  6.7 GiB     0 B  114 MiB   12 GiB  36.51  1.04   37      up
 4    hdd  0.01819   1.00000   19 GiB  7.8 GiB  7.7 GiB     0 B  118 MiB   11 GiB  41.86  1.19   39      up
 7    hdd  0.01819   1.00000   19 GiB  6.9 GiB  6.8 GiB     0 B  136 MiB   12 GiB  36.97  1.05   41      up
 9    hdd  0.01819   1.00000   19 GiB  7.2 GiB  7.0 GiB     0 B  144 MiB   11 GiB  38.39  1.09   35      up
11    hdd  0.02730   1.00000   28 GiB  9.9 GiB  9.8 GiB     0 B   97 MiB   18 GiB  35.44  1.01   46      up
12    hdd  0.02730   1.00000   28 GiB   10 GiB   10 GiB     0 B   79 MiB   18 GiB  36.26  1.03   61      up
13    hdd  0.02730   1.00000   28 GiB   10 GiB   10 GiB     0 B   95 MiB   18 GiB  36.15  1.03   48      up
                       TOTAL  333 GiB  118 GiB  116 GiB  38 KiB  1.5 GiB  216 GiB  35.25
MIN/MAX VAR: 0.85/1.22  STDDEV: 3.97

新しく追加した osd に均されている

molpako commented 6 months ago

割り当てるOSDがなかったPGも新しく追加OSDを使い始めている

root@mon2:/# ceph tell 7.0 query
{
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "state": "active+clean",
    "epoch": 1028,
    "up": [
        5,
        9,
        3,
        12
    ],
    "acting": [
        5,
        9,
        3,
        12
    ],

molpako / ceph-cluster

Recovery PG_DEGRADED #5

PG_DEGRADED とは

PG 状態確認

OSD追加