gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.51k stars 1.07k forks source link

Not self healing hanging in "Possibly undergoing heal" #4296

Closed hellofesjoe closed 4 months ago

hellofesjoe commented 4 months ago

Description of problem: We have proxmox running with 2 servers that run virutal machines and third server that serves a the third leg for our replicated gluster fs. After I updated the switch connected to 2 of the servers they both restarted and now glusterfs is not healing the files anymore being in this state since a few hours:

Brick 192.168.40.11:/data/proxmox/vm_data /images/100/vm-100-disk-0.qcow2 - Possibly undergoing heal /images/113/vm-113-disk-0.qcow2 - Possibly undergoing heal /images/104/vm-104-disk-0.qcow2 /images/102/vm-102-disk-0.qcow2 - Possibly undergoing heal /images/101/vm-101-disk-0.qcow2

/images/111/vm-111-disk-0.qcow2 /images/110/vm-110-disk-0.qcow2 - Possibly undergoing heal /images/106/vm-106-disk-0.qcow2 Status: Connected Number of entries: 9 Brick vm-host-01.fes:/data/proxmox/vm_data /images/100/vm-100-disk-0.qcow2 - Possibly undergoing heal - Possibly undergoing heal - Possibly undergoing heal /images/110/vm-110-disk-0.qcow2 - Possibly undergoing heal Status: Connected Number of entries: 9 Brick vm-data-01.fes:/data/proxmox/vm_data Status: Connected Number of entries: 0

I tried using the Troubleshooting for self healing and looked into /var/log/glusterfs/glustershd.log with log level DEBUG and this is the output that I do not understand:

[2024-01-04 15:39:13.005632 +0000] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-gv_vmdata-replicate-0: option data-self-heal using set value on [2024-01-04 15:39:13.005657 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option entry-self-heal using set value on [2024-01-04 15:39:13.005779 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option granular-entry-heal using set value on [2024-01-04 15:39:13.005875 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option self-heal-daemon using set value enable [2024-01-04 15:39:13.005901 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option iam-self-heal-daemon using set value yes [2024-01-04 15:39:13.006042 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option use-anonymous-inode using set value yes [2024-01-04 15:39:13.006042 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.s o(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument] [2024-01-04 15:39:11.412604 +0000] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-gv_vmdata: option log-level using set value INFO [2024-01-04 15:39:13.004257 +0000] I [MSGID: 101221] [common-utils.c:3851:gf_set_volfile_server_common] 0-gluster: duplicate entry for volfile-server [{errno=17}, {error=File exists}] [2024-01-04 15:39:13.006068 +0000] D [MSGID: 0] [options.c:1114:xlator_reconfigure_rec] 0-gv_vmdata-replicate-0: reconfigured [2024-01-04 15:39:37.790284 +0000] I [io-stats.c:3706:ios_sample_buf_size_configure] 0-gv_vmdata: Configure ios_sample_buf size is 1024 because ios_sample_interval is 0 [2024-01-04 15:39:37.790369 +0000] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-gv_vmdata: option log-level using set value INFO [2024-01-04 15:39:37.790435 +0000] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-gv_vmdata: option threads using set value 16 [2024-01-04 15:39:37.792371 +0000] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: vm-data-01.fes:24007 192.168.40.11:24007 [2024-01-04 15:39:37.792444 +0000] I [MSGID: 101221] [common-utils.c:3851:gf_set_volfile_server_common] 0-gluster: duplicate entry for volfile-server [{errno=17}, {error=File exists}] [2024-01-04 15:39:37.792527 +0000] I [MSGID: 100040] [glusterfsd-mgmt.c:108:mgmt_process_volfile] 0-glusterfs: No change in volfile, countinuing [] [2024-01-04 15:39:38.050243 +0000] I [glusterfsd-mgmt.c:77:mgmt_cbk_spec] 0-mgmt: Volume file changed [2024-01-04 15:39:38.063828 +0000] I [glusterfsd-mgmt.c:77:mgmt_cbk_spec] 0-mgmt: Volume file changed [2024-01-04 15:39:38.074118 +0000] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: vm-data-01.fes:24007 192.168.40.11:24007 [2024-01-04 15:39:38.075177 +0000] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-gv_vmdata-client-3: option ping-timeout using set value 42 [2024-01-04 15:39:38.075336 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-client-3: option strict-locks using set value off [2024-01-04 15:39:38.075400 +0000] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-gv_vmdata-client-4: option ping-timeout using set value 42 [2024-01-04 15:39:38.075479 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-client-4: option strict-locks using set value off [2024-01-04 15:39:38.075525 +0000] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-gv_vmdata-client-5: option ping-timeout using set value 42 [2024-01-04 15:39:38.075602 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-client-5: option strict-locks using set value off [2024-01-04 15:39:38.075682 +0000] I [MSGID: 0] [options.c:1239:xlator_option_reconf_uint32] 0-gv_vmdata-replicate-0: option background-self-heal-count using set value 0 [2024-01-04 15:39:38.075737 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option metadata-self-heal using set value on [2024-01-04 15:39:38.075779 +0000] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-gv_vmdata-replicate-0: option data-self-heal using set value on [2024-01-04 15:39:38.075823 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option entry-self-heal using set value on [2024-01-04 15:39:38.076004 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option granular-entry-heal using set value on [2024-01-04 15:39:38.076104 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option self-heal-daemon using set value enable [2024-01-04 15:39:38.076130 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option iam-self-heal-daemon using set value yes [2024-01-04 15:39:38.076259 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option use-anonymous-inode using set value yes [2024-01-04 15:39:38.076320 +0000] I [io-stats.c:3706:ios_sample_buf_size_configure] 0-gv_vmdata: Configure ios_sample_buf size is 1024 because ios_sample_interval is 0 [2024-01-04 15:39:38.076367 +0000] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-gv_vmdata: option log-level using set value DEBUG [2024-01-04 15:39:38.076427 +0000] D [io-stats.c:3811:reconfigure] 0-gv_vmdata: reconfigure returning 0 [2024-01-04 15:39:38.076681 +0000] D [logging.c:1858:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk [2024-01-04 15:39:38.076424 +0000] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-gv_vmdata: option threads using set value 16 [2024-01-04 15:39:38.076676 +0000] D [MSGID: 0] [afr-self-heal-data.c:258:afr_selfheal_data_block] 0-gv_vmdata-replicate-0: gfid:acf6d110-db73-4609-a5e5-960ea34424c2, offset=73502162944, size=131072 [2024-01-04 15:39:38.077337 +0000] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: vm-data-01.fes:24007 192.168.40.11:24007 [2024-01-04 15:39:38.080387 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument] [2024-01-04 15:39:38.081023 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument] [2024-01-04 15:39:38.082406 +0000] D [logging.c:1858:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk [2024-01-04 15:39:38.077455 +0000] I [MSGID: 100040] [glusterfsd-mgmt.c:108:mgmt_process_volfile] 0-glusterfs: No change in volfile, countinuing [] [2024-01-04 15:39:38.082402 +0000] D [MSGID: 0] [afr-self-heal-data.c:258:afr_selfheal_data_block] 0-gv_vmdata-replicate-0: gfid:acf6d110-db73-4609-a5e5-960ea34424c2, offset=73502294016, size=131072 [2024-01-04 15:39:38.085188 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument] [2024-01-04 15:39:38.086849 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument] [2024-01-04 15:39:38.087378 +0000] D [MSGID: 0] [afr-self-heal-data.c:258:afr_selfheal_data_block] 0-gv_vmdata-replicate-0: gfid:acf6d110-db73-4609-a5e5-960ea34424c2, offset=73502425088, size=131072 [2024-01-04 15:39:38.090374 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument] [2024-01-04 15:39:38.091218 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument]

This last bit is repeated forever:

[2024-01-04 15:39:38.091218 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument]

How do I fix this problem? The files are not synched anymore (since it's undergoing healing) and some of them seem broken.

Mandatory info: - The output of the gluster volume info command:

Volume Name: gv_vmdata Type: Replicate Volume ID: 7085d24d-c7cd-4fc0-9013-10056de09057 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.40.11:/data/proxmox/vm_data Brick2: vm-host-01.fes:/data/proxmox/vm_data Brick3: vm-data-01.fes:/data/proxmox/vm_data Options Reconfigured: diagnostics.client-log-level: WARNING cluster.self-heal-daemon: enable cluster.granular-entry-heal: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off

- The output of the gluster volume status command:

Status of volume: gv_vmdata Gluster process TCP Port RDMA Port Online Pid

Brick 192.168.40.11:/data/proxmox/vm_data 49152 0 Y 1487 Brick vm-host-01.fes:/data/proxmox/vm_data 49152 0 Y 1315 Brick vm-data-01.fes:/data/proxmox/vm_data 49152 0 Y 1028 Self-heal Daemon on localhost N/A N/A Y 43836 Self-heal Daemon on vm-data-01.fes N/A N/A Y 5797 Self-heal Daemon on 192.168.40.11 N/A N/A Y 21357

Task Status of Volume gv_vmdata

There are no active volume tasks

- The output of the gluster volume heal command:

Brick 192.168.40.11:/data/proxmox/vm_data /images/100/vm-100-disk-0.qcow2 - Possibly undergoing heal /images/113/vm-113-disk-0.qcow2 - Possibly undergoing heal /images/104/vm-104-disk-0.qcow2 /images/102/vm-102-disk-0.qcow2 - Possibly undergoing heal /images/101/vm-101-disk-0.qcow2

/images/111/vm-111-disk-0.qcow2 /images/110/vm-110-disk-0.qcow2 - Possibly undergoing heal /images/106/vm-106-disk-0.qcow2 Status: Connected Number of entries: 9 Brick vm-host-01.fes:/data/proxmox/vm_data /images/100/vm-100-disk-0.qcow2 - Possibly undergoing heal - Possibly undergoing heal - Possibly undergoing heal /images/110/vm-110-disk-0.qcow2 - Possibly undergoing heal Status: Connected Number of entries: 9 Brick vm-data-01.fes:/data/proxmox/vm_data Status: Connected Number of entries: 0

- The operating system / glusterfs version:

gluster --version glusterfs 9.2

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

hellofesjoe commented 4 months ago

After around 18 hours it healed itself. I guess it just took a very long time.