Description of problem:
We have proxmox running with 2 servers that run virutal machines and third server that serves a the third leg for our replicated gluster fs. After I updated the switch connected to 2 of the servers they both restarted and now glusterfs is not healing the files anymore being in this state since a few hours:
/images/111/vm-111-disk-0.qcow2
/images/110/vm-110-disk-0.qcow2 - Possibly undergoing heal
/images/106/vm-106-disk-0.qcow2
Status: Connected
Number of entries: 9
Brick vm-host-01.fes:/data/proxmox/vm_data
/images/100/vm-100-disk-0.qcow2 - Possibly undergoing heal
- Possibly undergoing heal
- Possibly undergoing heal
/images/110/vm-110-disk-0.qcow2 - Possibly undergoing heal
Status: Connected
Number of entries: 9
Brick vm-data-01.fes:/data/proxmox/vm_data
Status: Connected
Number of entries: 0
I tried using the Troubleshooting for self healing and looked into /var/log/glusterfs/glustershd.log with log level DEBUG and this is the output that I do not understand:
[2024-01-04 15:39:13.005632 +0000] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-gv_vmdata-replicate-0: option data-self-heal using set value on
[2024-01-04 15:39:13.005657 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option entry-self-heal using set value on
[2024-01-04 15:39:13.005779 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option granular-entry-heal using set value on
[2024-01-04 15:39:13.005875 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option self-heal-daemon using set value enable
[2024-01-04 15:39:13.005901 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option iam-self-heal-daemon using set value yes
[2024-01-04 15:39:13.006042 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option use-anonymous-inode using set value yes
[2024-01-04 15:39:13.006042 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.s
o(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument]
[2024-01-04 15:39:11.412604 +0000] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-gv_vmdata: option log-level using set value INFO
[2024-01-04 15:39:13.004257 +0000] I [MSGID: 101221] [common-utils.c:3851:gf_set_volfile_server_common] 0-gluster: duplicate entry for volfile-server [{errno=17}, {error=File exists}]
[2024-01-04 15:39:13.006068 +0000] D [MSGID: 0] [options.c:1114:xlator_reconfigure_rec] 0-gv_vmdata-replicate-0: reconfigured
[2024-01-04 15:39:37.790284 +0000] I [io-stats.c:3706:ios_sample_buf_size_configure] 0-gv_vmdata: Configure ios_sample_buf size is 1024 because ios_sample_interval is 0
[2024-01-04 15:39:37.790369 +0000] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-gv_vmdata: option log-level using set value INFO
[2024-01-04 15:39:37.790435 +0000] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-gv_vmdata: option threads using set value 16
[2024-01-04 15:39:37.792371 +0000] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: vm-data-01.fes:24007 192.168.40.11:24007
[2024-01-04 15:39:37.792444 +0000] I [MSGID: 101221] [common-utils.c:3851:gf_set_volfile_server_common] 0-gluster: duplicate entry for volfile-server [{errno=17}, {error=File exists}]
[2024-01-04 15:39:37.792527 +0000] I [MSGID: 100040] [glusterfsd-mgmt.c:108:mgmt_process_volfile] 0-glusterfs: No change in volfile, countinuing []
[2024-01-04 15:39:38.050243 +0000] I [glusterfsd-mgmt.c:77:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2024-01-04 15:39:38.063828 +0000] I [glusterfsd-mgmt.c:77:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2024-01-04 15:39:38.074118 +0000] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: vm-data-01.fes:24007 192.168.40.11:24007
[2024-01-04 15:39:38.075177 +0000] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-gv_vmdata-client-3: option ping-timeout using set value 42
[2024-01-04 15:39:38.075336 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-client-3: option strict-locks using set value off
[2024-01-04 15:39:38.075400 +0000] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-gv_vmdata-client-4: option ping-timeout using set value 42
[2024-01-04 15:39:38.075479 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-client-4: option strict-locks using set value off
[2024-01-04 15:39:38.075525 +0000] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-gv_vmdata-client-5: option ping-timeout using set value 42
[2024-01-04 15:39:38.075602 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-client-5: option strict-locks using set value off
[2024-01-04 15:39:38.075682 +0000] I [MSGID: 0] [options.c:1239:xlator_option_reconf_uint32] 0-gv_vmdata-replicate-0: option background-self-heal-count using set value 0
[2024-01-04 15:39:38.075737 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option metadata-self-heal using set value on
[2024-01-04 15:39:38.075779 +0000] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-gv_vmdata-replicate-0: option data-self-heal using set value on
[2024-01-04 15:39:38.075823 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option entry-self-heal using set value on
[2024-01-04 15:39:38.076004 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option granular-entry-heal using set value on
[2024-01-04 15:39:38.076104 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option self-heal-daemon using set value enable
[2024-01-04 15:39:38.076130 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option iam-self-heal-daemon using set value yes
[2024-01-04 15:39:38.076259 +0000] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-gv_vmdata-replicate-0: option use-anonymous-inode using set value yes
[2024-01-04 15:39:38.076320 +0000] I [io-stats.c:3706:ios_sample_buf_size_configure] 0-gv_vmdata: Configure ios_sample_buf size is 1024 because ios_sample_interval is 0
[2024-01-04 15:39:38.076367 +0000] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-gv_vmdata: option log-level using set value DEBUG
[2024-01-04 15:39:38.076427 +0000] D [io-stats.c:3811:reconfigure] 0-gv_vmdata: reconfigure returning 0
[2024-01-04 15:39:38.076681 +0000] D [logging.c:1858:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk
[2024-01-04 15:39:38.076424 +0000] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-gv_vmdata: option threads using set value 16
[2024-01-04 15:39:38.076676 +0000] D [MSGID: 0] [afr-self-heal-data.c:258:afr_selfheal_data_block] 0-gv_vmdata-replicate-0: gfid:acf6d110-db73-4609-a5e5-960ea34424c2, offset=73502162944, size=131072
[2024-01-04 15:39:38.077337 +0000] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: vm-data-01.fes:24007 192.168.40.11:24007
[2024-01-04 15:39:38.080387 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument]
[2024-01-04 15:39:38.081023 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument]
[2024-01-04 15:39:38.082406 +0000] D [logging.c:1858:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk
[2024-01-04 15:39:38.077455 +0000] I [MSGID: 100040] [glusterfsd-mgmt.c:108:mgmt_process_volfile] 0-glusterfs: No change in volfile, countinuing []
[2024-01-04 15:39:38.082402 +0000] D [MSGID: 0] [afr-self-heal-data.c:258:afr_selfheal_data_block] 0-gv_vmdata-replicate-0: gfid:acf6d110-db73-4609-a5e5-960ea34424c2, offset=73502294016, size=131072
[2024-01-04 15:39:38.085188 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument]
[2024-01-04 15:39:38.086849 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument]
[2024-01-04 15:39:38.087378 +0000] D [MSGID: 0] [afr-self-heal-data.c:258:afr_selfheal_data_block] 0-gv_vmdata-replicate-0: gfid:acf6d110-db73-4609-a5e5-960ea34424c2, offset=73502425088, size=131072
[2024-01-04 15:39:38.090374 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument]
[2024-01-04 15:39:38.091218 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument]
This last bit is repeated forever:
[2024-01-04 15:39:38.091218 +0000] D [dict.c:2830:dict_get_str_boolean] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/protocol/client.so(+0x5addd) [0x7fdd54b97ddd] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/cluster/replicate.so(+0x38dd2) [0x7fdd54ad8dd2] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get_str_boolean+0xcf) [0x7fdd5a3561ff] ) 0-dict: key buf-has-zeroes, integer type asked, has unsigned integer type [Invalid argument]
How do I fix this problem? The files are not synched anymore (since it's undergoing healing) and some of them seem broken.
Mandatory info:- The output of the gluster volume info command:
Volume Name: gv_vmdata
Type: Replicate
Volume ID: 7085d24d-c7cd-4fc0-9013-10056de09057
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.40.11:/data/proxmox/vm_data
Brick2: vm-host-01.fes:/data/proxmox/vm_data
Brick3: vm-data-01.fes:/data/proxmox/vm_data
Options Reconfigured:
diagnostics.client-log-level: WARNING
cluster.self-heal-daemon: enable
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
- The output of the gluster volume status command:
Status of volume: gv_vmdata
Gluster process TCP Port RDMA Port Online Pid
Brick 192.168.40.11:/data/proxmox/vm_data 49152 0 Y 1487
Brick vm-host-01.fes:/data/proxmox/vm_data 49152 0 Y 1315
Brick vm-data-01.fes:/data/proxmox/vm_data 49152 0 Y 1028
Self-heal Daemon on localhost N/A N/A Y 43836
Self-heal Daemon on vm-data-01.fes N/A N/A Y 5797
Self-heal Daemon on 192.168.40.11 N/A N/A Y 21357
Description of problem: We have proxmox running with 2 servers that run virutal machines and third server that serves a the third leg for our replicated gluster fs. After I updated the switch connected to 2 of the servers they both restarted and now glusterfs is not healing the files anymore being in this state since a few hours:
I tried using the Troubleshooting for self healing and looked into /var/log/glusterfs/glustershd.log with log level DEBUG and this is the output that I do not understand:
This last bit is repeated forever:
How do I fix this problem? The files are not synched anymore (since it's undergoing healing) and some of them seem broken.
Mandatory info: - The output of the
gluster volume info
command:- The output of the
gluster volume status
command:- The output of the
gluster volume heal
command:- The operating system / glusterfs version:
Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration