Open dalwise opened 2 years ago
I've done a quick check of the self-heal logs, and it seems there are no issues. Healing seems to be progressing. There have been around 8 million files healed since 12th of December.
How many files are there in the volume ? is it possible that there are more than 40 million files with many directories ?
Can you check if the number of files in host_name20 is increasing ? if so, it means that self-heal is doing its job, but it's running too slowly. If that's the case, probably you could use the recommendations from issue #3025. You should disable least priority:
# gluster volume set <volname> performance.enable-least-priority no
And you can also start additional self-heal processes as specified in that issue.
IMPORTANT: Issue #3025 is for Gluster 9.4. It should also work in your version, but I recommend that you don't try this in production without having done tests in a lab environment.
Thanks for your help, Xavi.
How many files are there in the volume ? is it possible that there are more than 40 million files with many directories ?
Yes, from the brick directory on each host:
[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
46536502
[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
27520629
[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
113235113
[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
30124557
[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
113281980
[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
30169389
I am not sure what would explain the discrepancy between host_name21 and host_name22, since there were no files in need of healing on them when host_name20 got removed and nothing outside of gluster should currently be writing to the volume.
Can you check if the number of files in host_name20 is increasing ?
The number of files shown using the command above is slowly increasing, but after two weeks it is still less than half what the other two servers have.
One thing I have noticed and was not expecting is that the current brick size on host_name20 (the arbiter) is already larger than it was on the backup of the brick that we made before wiping it:
[root@host_name20 .brick]# du -sh /shared/dot_brick_backup/
4.0G /shared/dot_brick_backup/
[root@host_name20 .brick]# du -sh /shared/.brick/
5.0G /shared/.brick/
I'll try the suggestions from #3025. I need the data on the volume, but this stack of servers has been taken out of production until we recover it. So I'm thinking of first backing up the data and then trying out the improvements from #3025 on the affected stack itself. Worst case scenario I have the backed up data and can copy it back to a fresh volume. Would it be OK to back up the data directly from the host_name21 or host_name22 brick? I assume this would be faster than backing up from the mounted volume and there is nothing currently being written on the volume.
One thing that I am uncertain about when following the guidance in #3025 is what server I should be monitoring for IO stats: host_name20 is the one that re-joined the volume and is in need of repair, so I'm not quite sure why heal count
shows no files needing healing on it and instead shows them on the remaining two servers.
Thanks for your help, Xavi.
How many files are there in the volume ? is it possible that there are more than 40 million files with many directories ?
Yes, from the brick directory on each host:
[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l 46536502 [root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l 27520629
[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l 113235113 [root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l 30124557
[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l 113281980 [root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l 30169389
I am not sure what would explain the discrepancy between host_name21 and host_name22, since there were no files in need of healing on them when host_name20 got removed and nothing outside of gluster should currently be writing to the volume.
If there's new data being added to the volume, it could be explained by the different times the find
command was run. Otherwise it's something we will need to check once self-heal completes.
Can you check if the number of files in host_name20 is increasing ?
The number of files shown using the command above is slowly increasing, but after two weeks it is still less than half what the other two servers have.
Gluster needs a significant amount of metadata and locking operations per entry. Given the number of files and directories you have, it's likely that most of the time is spent doing those operations. I think parallelizing the self-help should help here.
One thing I have noticed and was not expecting is that the current brick size on host_name20 (the arbiter) is already larger than it was on the backup of the brick that we made before wiping it:
[root@host_name20 .brick]# du -sh /shared/dot_brick_backup/ 4.0G /shared/dot_brick_backup/ [root@host_name20 .brick]# du -sh /shared/.brick/ 5.0G /shared/.brick/
Are you using XFS as the brick filesystem ? we have seen that XFS does preallocation of disk blocks in some cases to optimize future writes. Since currently self-heal is creating a lot of new entries, this extra reservation of blocks could be causing the overhead. If that's the case, it should fix itself over time once the volume is healed.
I'll try the suggestions from #3025. I need the data on the volume, but this stack of servers has been taken out of production until we recover it. So I'm thinking of first backing up the data and then trying out the improvements from #3025 on the affected stack itself. Worst case scenario I have the backed up data and can copy it back to a fresh volume. Would it be OK to back up the data directly from the host_name21 or host_name22 brick? I assume this would be faster than backing up from the mounted volume and there is nothing currently being written on the volume.
If you plan to restore the data only to a new volume and you are sure that the data in those bricks is healthy, then yes, it should be enough to copy one brick. However, given the discrepancy seen in the outputs of find
command, first I would try to understand why that happened to be sure we are not missing something by only copying one brick.
One thing that I am uncertain about when following the guidance in #3025 is what server I should be monitoring for IO stats: host_name20 is the one that re-joined the volume and is in need of repair, so I'm not quite sure why
heal count
shows no files needing healing on it and instead shows them on the remaining two servers.
The number that heal count
shows is not the number of entries that need healing in each brick. It's the number of entries updated in each brick that failed to update in another brick (so needing heal in another brick). When a brick goes down, the heal count of the surviving bricks will start to increase when changes are done to entries that are also present in the missing brick. When the bad brick comes back again, the heal count of that brick will be 0, but the other bricks will have the list of entries that need to be healed in the bad brick. So most of the write activity will happen in the bad brick, even if it has heal count = 0.
If there's new data being added to the volume, it could be explained by the different times the find command was run. Otherwise it's something we will need to check once self-heal completes.
The hosts were taken out of production soon after the initial problem, on December 10. There should be no data being written to the volume.
Are you using XFS as the brick filesystem ? we have seen that XFS does preallocation of disk blocks in some cases to optimize future writes. Since currently self-heal is creating a lot of new entries, this extra reservation of blocks could be causing the overhead. If that's the case, it should fix itself over time once the volume is healed.
Yes, the brick's filesystem is indeed XFS.
If you plan to restore the data only to a new volume and you are sure that the data in those bricks is healthy, then yes, it should be enough to copy one brick. However, given the discrepancy seen in the outputs of find command, first I would try to understand why that happened to be sure we are not missing something by only copying one brick.
I have been doing an rsync from the mounted volume, but it is progressing at only ~50GB/day. Since the volume has 1.3T the backup would take almost a month. Perhaps I can try backing up both bricks rather than just one before trying the repairs from #3025?
The number that heal count shows is not the number of entries that need healing in each brick. It's the number of entries updated in each brick that failed to update in another brick (so needing heal in another brick). When a brick goes down, the heal count of the surviving bricks will start to increase when changes are done to entries that are also present in the missing brick. When the bad brick comes back again, the heal count of that brick will be 0, but the other bricks will have the list of entries that need to be healed in the bad brick. So most of the write activity will happen in the bad brick, even if it has heal count = 0.
Thank you for your very clear explanation. If most of the activity will happen on the brick of host_name20 then I think it may be challenging to evaluate the effect of changes by looking at I/O stats as in #3025: since host_name20 is an arbiter the amount of data written to it as far as I know is minimal. Do you think we may be better off monitoring the effect of changes with other metrics? Maybe the ones above like the number of entries in heal-count
or the results of the find
on the brick?
The hosts were taken out of production soon after the initial problem, on December 10. There should be no data being written to the volume.
In that case the difference in number of files and directories seems suspicious. Maybe you could compare the output of two finds to see where are the differences.
I have been doing an rsync from the mounted volume, but it is progressing at only ~50GB/day. Since the volume has 1.3T the backup would take almost a month. Perhaps I can try backing up both bricks rather than just one before trying the repairs from #3025?
I think that will be the safest way until we understand where the difference comes from.
Thank you for your very clear explanation. If most of the activity will happen on the brick of host_name20 then I think it may be challenging to evaluate the effect of changes by looking at I/O stats as in #3025: since host_name20 is an arbiter the amount of data written to it as far as I know is minimal. Do you think we may be better off monitoring the effect of changes with other metrics? Maybe the ones above like the number of entries in
heal-count
or the results of thefind
on the brick?
Probably the easiest way to monitor the progress of self-heal and be able to compare different configurations is to check the number of used inodes of the brick periodically (df -i <brick path>
). It can't be used to know exactly how many files have been created because Gluster uses 2 inodes for directories and 1 inode for other entries, but it's enough to do a comparison, for example using the average number of inodes created per second.
Thanks Xavi! I'll do the following then:
1) Look for differences in the find
of bricks in host_name21
and host_name22
2) Back up the bricks on both hosts
3) Set up monitoring of df -i <brick path>
so that we can have a baseline before making changes.
Hi Xavi, I'm just providing a quick update. The backups are still ongoing, even when backing up directly from the bricks. I ran out of inodes for the backup, so I'm removing .glusterfs
directory from the backup.
Hi @dalwise. How is it going ?
Is the self-heal still progressing on the arbiter brick or have you stopped it ? if it's still running, probably it should have already identified all pending files to be healed and the number of pending files should be near 80 or 90 million and slowly decreasing. Is that the case ?
Is self-heal log still clean of errors ?
Hi @xhernandez, thanks for checking in!
Self-heal is ongoing. It is currently at over 64M and increasing by about ~50k per hour.
I had some problems backing up the bricks. I first tried backing up to the same partition as the brick but I ran out of inodes. I'm not sure why this happened as previous to the backup only 44% of the inodes were being used, so even doubling it should not have been an issue. My only theory is that rsync creates temporary inodes and those pushed the inode count.
I have since completed a local backup excluding the .glusterfs directory, which worked ok. I am now in the process of transferring that directory to a different host for offline backup. That has currently backed up ~92M inodes out of ~113M (but I'm using rsync, so based on the previous paragraph I'm not sure whether it will go somewhat over 113M before completing).
As for errors I don't see any on /var/log/glusterfs/glfsheal-vol_name.log
in the arbiter node and in fact the last message is old:
[2021-12-29 18:09:48.459408] I [MSGID: 104041] [glfs-resolve.c:954:__glfs_active_subvol] 0-vol_name: switched to graph 716d6e32-302d-3133-3539-39342d323032 (0)
"/var/log/glusterfs/glfsheal-vol_name.log" 952L, 45317C
The remaining two nodes have that file empty.
Self-heal is ongoing. It is currently at over 64M and increasing by about ~50k per hour.
I thought it would have discovered all directories by now. I was too optimistic...
I had some problems backing up the bricks. I first tried backing up to the same partition as the brick but I ran out of inodes. I'm not sure why this happened as previous to the backup only 44% of the inodes were being used, so even doubling it should not have been an issue. My only theory is that rsync creates temporary inodes and those pushed the inode count.
Many of the files inside .glusterfs are hardlinks to entries outside it, so unless you use -H
option in rsync, the number of inodes created will be much higher.
I have since completed a local backup excluding the .glusterfs directory, which worked ok. I am now in the process of transferring that directory to a different host for offline backup. That has currently backed up ~92M inodes out of ~113M (but I'm using rsync, so based on the previous paragraph I'm not sure whether it will go somewhat over 113M before completing).
Just for reference, directories use two inodes, one for the real directory and one for a special symlink created inside .glusterfs. All other entries use a single inode with two hardlinks.
As for errors I don't see any on
/var/log/glusterfs/glfsheal-vol_name.log
in the arbiter node and in fact the last message is old:[2021-12-29 18:09:48.459408] I [MSGID: 104041] [glfs-resolve.c:954:__glfs_active_subvol] 0-fava: switched to graph 716d6e32-302d-3133-3539-39342d323032 (0) "/var/log/glusterfs/glfsheal-vol_name.log" 952L, 45317C
The remaining two nodes have that file empty.
You need to look at glustershd.log. That's the log used by self-heal during reconstruction of the arbiter.
Many of the files inside .glusterfs are hardlinks to entries outside it, so unless you use -H option in rsync, the number of inodes created will be much higher.
Just for reference, directories use two inodes, one for the real directory and one for a special symlink created inside .glusterfs. All other entries use a single inode with two hardlinks.
Thanks for the tips!
You need to look at glustershd.log. That's the log used by self-heal during reconstruction of the arbiter.
The file on the arbiter is not showing any errors in the last few days. It seems to be healing fine. Here are the last few messages:
[2022-01-06 06:52:35.799445] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on e64041e0-7d9b-4ef0-b094-072a52d55588
[2022-01-06 06:52:35.801009] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on 0f9bd30f-2fe8-4ddd-90c0-c7eee2882f9f
[2022-01-06 06:52:35.840371] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on 749d8c7c-5348-40a3-83ab-f8fd3d357462
[2022-01-06 06:52:35.841280] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on e13f9126-3cf6-45c9-8dc8-34791ff12917. sources=[2] sinks=0 1
[2022-01-06 06:52:35.877509] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on dc89a327-a633-4be1-b8b1-363e7a9bfaf4. sources=[2] sinks=0 1
[2022-01-06 06:52:35.883342] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 6aa9108f-b5c0-44d7-a438-53b274522157. sources=[2] sinks=0 1
[2022-01-06 06:52:35.884929] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on e64041e0-7d9b-4ef0-b094-072a52d55588. sources=[2] sinks=0 1
[2022-01-06 06:52:35.885091] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 0f9bd30f-2fe8-4ddd-90c0-c7eee2882f9f. sources=[2] sinks=0 1
[2022-01-06 06:52:35.892500] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 749d8c7c-5348-40a3-83ab-f8fd3d357462. sources=[2] sinks=0 1
[2022-01-06 06:54:10.963909] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on 5ed1e5c6-e8a6-4903-9525-da8ac0dc0467. sources= sinks=0 1 2
[2022-01-09 03:43:13.033691] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-09 03:43:13.033783] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-09 03:43:13.035032] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-09 03:43:13.035141] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-12 03:10:13.376457] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-12 03:10:13.376474] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-12 03:10:13.377881] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-12 03:10:13.378020] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-16 03:21:14.222491] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-16 03:21:14.222612] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-16 03:21:14.224039] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-16 03:21:14.224112] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
The same file on the data nodes is much more chatty, but seems to be fine also. It contains entries like"
[2022-01-18 22:47:43.073863] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on 812e4ac8-7ecd-4a29-8490-cfe3475839c1
[2022-01-18 22:47:43.154375] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 812e4ac8-7ecd-4a29-8490-cfe3475839c1. sources=0 [1] sinks=2
Both data nodes seem to have no errors:
# egrep -c '\]\sE\s\[' /var/log/glusterfs/glustershd.log
0
Whereas the arbiter just has some errors from January 6 when one of the data nodes was left without available inodes.
Hi @xhernandez,
The backup completed and I have been applying the solutions that you suggested from #3025, while monitoring the increase in inodes (via df -i
). Unfortunately I haven't seen much noticeable improvement:
The black vertical line is when I ran gluster volume set vol_name performance.enable-least-priority no
. It made a small improvement (from ~900 to ~920 inodes/minute), but it disappeared over time.
The green vertical lines are for when I started each of 5 additional glusterfs processes.
The purple vertical line is when I ran gluster volume set vol_name cluster.data-self-heal-algorithm full
The dips every hour are from the other monitoring which runs gluster vol heal vol_name statistics heal-count
.
ps aux | grep glustershd
shows the following. I have followed the pattern from the original glustershd I had running, even though it is slightly different from the one in #3025, which I assume is due to the different Gluster version (mine has asterisks in the --xlator-option
argument):
root 9478 0.0 0.0 675796 5848 ? Ssl Jan28 0:03 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra2.pid -l /var/log/glusterfs/glustershd-extra2.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root 59414 1.0 0.0 1034436 35664 ? Ssl 2021 756:10 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/f438aa9983d7f1fd.socket --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root 65548 0.0 0.0 675796 7800 ? Ssl Jan28 0:03 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra1.pid -l /var/log/glusterfs/glustershd-extra1.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root 71572 0.0 0.0 675796 7788 ? Ssl Jan28 0:02 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra5.pid -l /var/log/glusterfs/glustershd-extra5.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root 128316 0.0 0.0 675796 5828 ? Ssl Jan28 0:02 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra4.pid -l /var/log/glusterfs/glustershd-extra4.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root 184165 0.0 0.0 675796 7872 ? Ssl Jan28 0:02 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra3.pid -l /var/log/glusterfs/glustershd-extra3.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root 216441 0.0 0.0 112712 976 pts/0 S+ 04:28 0:00 grep --color=auto glustershd
@dalwise when you do configuration changes, you should also restart the self-heal processes started manually by you because they are not under the control of glusterd to apply the changes.
In any case, if that doesn't help, I would also try to modify these settings (restarting shd processes after that):
cluster.self-heal-window-size
to 4 or 8server.event-threads
and client.event-threads
to 4 or 6.Can you post the current configuration settings ?
Are logs from extra self-heal processes showing activity ?
Thank you @xhernandez for your help and quick response!
@dalwise when you do configuration changes, you should also restart the self-heal processes started manually by you because they are not under the control of glusterd to apply the changes.
Thanks for the tip
In any case, if that doesn't help, I would also try to modify these settings (restarting shd processes after that):
Increase cluster.self-heal-window-size to 4 or 8 Increase server.event-threads and client.event-threads to 4 or 6.
I did all of these. Not much difference though:
Can you post the current configuration settings ?
Sure. I assume you mean the output of gluster volume get vol_name all
. Here is it:
gluster-settings-220201.txt
Are logs from extra self-heal processes showing activity ?
As far as I can tell they are not. Here are the logs from /var/log/glusterfs/glustershd-extra1.log
:
The process that is generating it is:
root 152776 0.0 0.0 757724 5724 ? Ssl 03:18 0:00 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra1.pid -l /var/log/glusterfs/glustershd-extra1.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
Just to double check that we are on the same page: I have been running the commands you suggested and starting the extra glusterfs processes on the arbiter node, which is the one in need of healing. I haven't made any changes on the existing data nodes. My understanding is that the settings propagate to all nodes since they apply to the volume, but I want to be sure that we are on the same page regarding the extra glusterfs processes.
Just to double check that we are on the same page: I have been running the commands you suggested and starting the extra glusterfs processes on the arbiter node, which is the one in need of healing. I haven't made any changes on the existing data nodes. My understanding is that the settings propagate to all nodes since they apply to the volume, but I want to be sure that we are on the same page regarding the extra glusterfs processes.
I missed to comment about this before... sorry.
Self-heal processes are run in each cluster node, and each one takes care of the list of pending files present in its own node only. This means that you should start extra self-heal processes in the other nodes (the ones that have a long list of pending files in gluster volume heal info). Since the arbiter brick was empty initially, it doesn't have damaged files, it's the other nodes that have damaged files pending to be healed into the arbiter brick.
I hope that starting extra processes in the other nodes you should see some increase in throughput. At least you should see activity in the logs.
Hi @xhernandez, I'm definitely glad that we clarified that and I'm sorry for not bringing it up earlier. Adding extra processes in the other two nodes caused a huge increase in rate of new inodes created on the arbiter for about two hours. After that it has come down again, but is still considerably more than it was before the new processes (~2200/minute vs ~900/minute):
Each green line is when a new process was started on both of the data nodes.
As you can see just adding one additional process improved the stats by over an order of magnitude. The second one also seemed to help, but it's unclear that additional ones made any difference.
I'm not sure why a second process would have such a great effect (much more than double). Is it possible that the settings that I had been applying were not being applied to the process that was already running on each node?
The other metric that I was tracking (files needing healing) peaked while starting the additional processes and is now decreasing and changing much faster than previously. If it continues at this rate it would come down to zero within couple of days:
Wed Feb 2 15:45:01 UTC 2022 Number of entries: 77363403
Wed Feb 2 16:45:01 UTC 2022 Number of entries: 77395054
Wed Feb 2 17:45:01 UTC 2022 Number of entries: 77426850
Wed Feb 2 18:45:01 UTC 2022 Number of entries: 77458946
Wed Feb 2 19:45:01 UTC 2022 Number of entries: 77490137
Wed Feb 2 20:45:01 UTC 2022 Number of entries: 76807711
Wed Feb 2 21:45:01 UTC 2022 Number of entries: 74170119
Wed Feb 2 22:45:01 UTC 2022 Number of entries: 71708505
Wed Feb 2 23:45:01 UTC 2022 Number of entries: 69943704
Thu Feb 3 00:45:01 UTC 2022 Number of entries: 68211289
Thank you very much for your help in speeding up the healing process. I'll continue to monitor the progress. Would you have any guidance as to why the original problem happened and whether there was a better remediation available that may have saved us from the millions of entries in need of healing?
Hi @xhernandez, I'm definitely glad that we clarified that and I'm sorry for not bringing it up earlier. Adding extra processes in the other two nodes caused a huge increase in rate of new inodes created on the arbiter for about two hours. After that it has come down again, but is still considerably more than it was before the new processes (~2200/minute vs ~900/minute):
That's great :)
Each green line is when a new process was started on both of the data nodes.
As you can see just adding one additional process improved the stats by over an order of magnitude. The second one also seemed to help, but it's unclear that additional ones made any difference.
I'm not sure why a second process would have such a great effect (much more than double). Is it possible that the settings that I had been applying were not being applied to the process that was already running on each node?
Self-heal daemons controlled by glusterd should have applied the changes immediately, so all settings should have been already in effect. It might be related to the way self-heal works. When a directory is healed, all entries inside it are created, but not healed yet. This means that after healing a directory, many files may have been created and added to the pending heal queue, increasing the pending heal count. This can cause bursts of file creations that could explain why you saw that increment (though I can't tell for sure why the improvement was so huge at the beginning). Once the number of directories pending heal is reduced, file creation rate drops and healers process more files than directories.
The other metric that I was tracking (files needing healing) peaked while starting the additional processes and is now decreasing and changing much faster than previously. If it continues at this rate it would come down to zero within couple of days:
Wed Feb 2 15:45:01 UTC 2022 Number of entries: 77363403 Wed Feb 2 16:45:01 UTC 2022 Number of entries: 77395054 Wed Feb 2 17:45:01 UTC 2022 Number of entries: 77426850 Wed Feb 2 18:45:01 UTC 2022 Number of entries: 77458946 Wed Feb 2 19:45:01 UTC 2022 Number of entries: 77490137 Wed Feb 2 20:45:01 UTC 2022 Number of entries: 76807711 Wed Feb 2 21:45:01 UTC 2022 Number of entries: 74170119 Wed Feb 2 22:45:01 UTC 2022 Number of entries: 71708505 Wed Feb 2 23:45:01 UTC 2022 Number of entries: 69943704 Thu Feb 3 00:45:01 UTC 2022 Number of entries: 68211289
The healing rate seems quite good now. Even though there are not so many file creations, most of the heals are just to set the metadata to the files created during directory heal. Since this is an arbiter brick, no data transfer is needed, so each heal is quite fast and files get removed from the pending list at a good pace.
Thank you very much for your help in speeding up the healing process. I'll continue to monitor the progress. Would you have any guidance as to why the original problem happened and whether there was a better remediation available that may have saved us from the millions of entries in need of healing?
When there's an issue with an specific directory, normally it's because some metadata or extended attribute doesn't match and self heal is unable to find enough consistent bricks to decide how to fix it in an automatic way (it requires more intelligence than what it's implemented inside it). However, after manual inspection of the metadata from each brick, it's normally easy to identify what's wrong and fix it (specially for directories, for files it may be harder if the contents don't match).
In your particular case, I can't tell what happened because I don't have the data before the brick was removed. However, in your first comment, you said that the directory was in split brain. This means that there were inconsistent changes between bricks, without a clear "clean" version of that directory from where to pull the data from the other bricks, so self-heal wasn't able to progress and the directory wasn't accessible.
When you deal with such a big volumes with tens of millions of directories and files, volume wise operations, like a full heal and rebalancing, are going to be slow. So if they can be avoided, they should be avoided. In this case, trying to fix whatever was wrong in that directory would have been much faster (most probably). It should be the first attempt always.
Also, you are using a pretty old version. There have been several heal and consistency improvements over time, besides bug fixes, that make some problems harder to happen, and more cases can be healed. I would recommend to keep the version more up to date to avoid some of the known problems, if possible.
Thanks for the feedback @xhernandez.
As we were expecting the count of files in need of healing decreased quickly over the last couple of days. But it seems to have now stopped at around 2475. It has remained there for the last 7 hours:
# gluster vol heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful
Brick host_name21:/shared/.brick
Number of entries: 2475
Brick host_name22:/shared/.brick
Number of entries: 1
Brick host_name20:/shared/.brick
Number of entries: 1
I have now killed the extra glusterfs
commands on all three nodes, since we no longer seem to have a performance issue and they could complicate checking for errors in logs.
I tried running gluster volume heal vol_name full
, but that didn't seem to have any effect.
I am looking for errors in the logs using egrep '\]\sE\s\[' <file>
. I've only found recent errors in /var/log/glusterfs/<mount_point>.log
, which all look like:
[2022-02-07 19:10:29.206069] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7f3514ff993a] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8221)[0x7f350c3a9221] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8b3a)[0x7f350c3a9b3a] (--> /lib64/libpthread.so.0(+0x7dd5)[0x7f3513e3add5] (--> /lib64/libc.so.6(clone+0x6d)[0x7f3513701ead] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory
But I'm not sure what file it is trying to write to.
I have seen online that doing a stat on every file in the mounted filesystem may trigger a heal, although given the size of the volume that may take a while. Do you have any suggestion on how speed up recovery of those last few thousand files?
By the way, one thing that I have noticed is that even though the number of entries needing healing on host_name21 have remained fairly stable at around 2475, the entries in host_name20 & host_name20 have varied quite much, despite the fact that nothing should currently be writing to the mounted volume. For example just a few minutes ago:
# gluster vol heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful
Brick host_name21:/shared/.brick
Number of entries: 2474
Brick host_name22:/shared/.brick
Number of entries: 395
Brick host_name20:/shared/.brick
Number of entries: 394
The glustershd log shows heals on all nodes, but the number of entries on host_name21 remains stable. For example:
[root@host_name20 glusterfs]# date
Tue Feb 8 00:16:27 UTC 2022
[root@host_name20 glusterfs]# tail -n3 glustershd.log
[2022-02-08 00:16:28.413337] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on 975cbb07-4584-444e-9ead-81d3405b209f. sources=[2] sinks=0 1
[2022-02-08 00:16:28.415346] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on 975cbb07-4584-444e-9ead-81d3405b209f
[2022-02-08 00:16:28.435790] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on 975cbb07-4584-444e-9ead-81d3405b209f. sources=[2] sinks=0 1
I tried running
gluster volume heal vol_name full
, but that didn't seem to have any effect.
This command just starts a full scan of the volume checking every file, which means that it may take some time to reach to the problematic files. However, if the files you want to heal already appear in the heal info, a full scan is not really necessary. Self-heal will already try to heal them. If it can't it means that something is wrong with those files and needs to be checked manually.
I am looking for errors in the logs using
egrep '\]\sE\s\[' <file>
. I've only found recent errors in/var/log/glusterfs/<mount_point>.log
, which all look like:[2022-02-07 19:10:29.206069] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7f3514ff993a] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8221)[0x7f350c3a9221] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8b3a)[0x7f350c3a9b3a] (--> /lib64/libpthread.so.0(+0x7dd5)[0x7f3513e3add5] (--> /lib64/libc.so.6(clone+0x6d)[0x7f3513701ead] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory
That could happen sometimes when Gluster is trying to invalidate some file that the kernel has already removed from its cache. If there isn't anything else in the logs, I wouldn't care much about this message.
I have seen online that doing a stat on every file in the mounted filesystem may trigger a heal, although given the size of the volume that may take a while. Do you have any suggestion on how speed up recovery of those last few thousand files?
You should run a gluster volume <volname> heal info
. It will show you the list of problematic files. Then take one of them and run these commands on it from each brick, not the mount point:
# stat <file>
# getfattr -m. -e hex -d <file>
This will provide the necessary information to fix the that file. Most probably other files will have the a similar problem, so we may be able to apply the same solution to the other files, but first let's see what happens on one of them.
By the way, one thing that I have noticed is that even though the number of entries needing healing on host_name21 have remained fairly stable at around 2475, the entries in host_name20 & host_name20 have varied quite much, despite the fact that nothing should currently be writing to the mounted volume. For example just a few minutes ago:
This is probably caused by the self-heal full command, that has found some additional issues. These could be related to the discrepancies you detected between the two remaining nodes at the beginning, or it could also be related to the problems caused when one of the bricks exhausted all its inodes.
Most of the entries shown on gluster vol heal vol_name info
are gfid entries like <gfid:9d83ace4-e904-4868-be69-0032e05acc5f>
however a few are filenames, so I'll start with those. All of the non-gfid entries are in the logs.bak2
directory. Here's one that shows listed both for host_name20 and host_name22:
host_name20# stat logs.bak2/TW8YX11205001/J184091T50529848
File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
Size: 27 Blocks: 0 IO Block: 4096 directory
Device: fd00h/64768d Inode: 4093747016 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-08 19:12:09.229701706 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.228701706 +0000
Birth: -
host_name20# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4
host_name21# stat logs.bak2/TW8YX11205001/J184091T50529848
File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
Size: 27 Blocks: 0 IO Block: 4096 directory
Device: fd00h/64768d Inode: 38842218 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-05-13 15:18:03.976526580 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.228041551 +0000
Birth: -
host_name21# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4
host_name22# stat logs.bak2/TW8YX11205001/J184091T50529848
File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
Size: 27 Blocks: 0 IO Block: 4096 directory
Device: fd00h/64768d Inode: 4129139457 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-05-13 15:18:03.976526580 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.227005682 +0000
Birth: -
host_name22# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4
I am fine deleting the whole logs.bak2
directory to get rid of these entries pending heal. Would deleting the directory on the mounted volume be effective in solving the issue for these files?
Most of the entries shown on
gluster vol heal vol_name info
are gfid entries like<gfid:9d83ace4-e904-4868-be69-0032e05acc5f>
however a few are filenames, so I'll start with those. All of the non-gfid entries are in thelogs.bak2
directory. Here's one that shows listed both for host_name20 and host_name22:host_name20# stat logs.bak2/TW8YX11205001/J184091T50529848 File: ‘logs.bak2/TW8YX11205001/J184091T50529848’ Size: 27 Blocks: 0 IO Block: 4096 directory Device: fd00h/64768d Inode: 4093747016 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-02-08 19:12:09.229701706 +0000 Modify: 2021-11-11 19:30:30.313584109 +0000 Change: 2022-02-08 19:12:09.228701706 +0000 Birth: - host_name20# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848 # file: logs.bak2/TW8YX11205001/J184091T50529848 trusted.afr.vol_name-client-0=0x000000000000000000000000 trusted.afr.vol_name-client-1=0x000000000000000000000000 trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4
host_name21# stat logs.bak2/TW8YX11205001/J184091T50529848 File: ‘logs.bak2/TW8YX11205001/J184091T50529848’ Size: 27 Blocks: 0 IO Block: 4096 directory Device: fd00h/64768d Inode: 38842218 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-05-13 15:18:03.976526580 +0000 Modify: 2021-11-11 19:30:30.313584109 +0000 Change: 2022-02-08 19:12:09.228041551 +0000 Birth: - host_name21# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848 # file: logs.bak2/TW8YX11205001/J184091T50529848 trusted.afr.vol_name-client-1=0x000000000000000000000000 trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4
host_name22# stat logs.bak2/TW8YX11205001/J184091T50529848 File: ‘logs.bak2/TW8YX11205001/J184091T50529848’ Size: 27 Blocks: 0 IO Block: 4096 directory Device: fd00h/64768d Inode: 4129139457 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-05-13 15:18:03.976526580 +0000 Modify: 2021-11-11 19:30:30.313584109 +0000 Change: 2022-02-08 19:12:09.227005682 +0000 Birth: - host_name22# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848 # file: logs.bak2/TW8YX11205001/J184091T50529848 trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4
The data on all 3 bricks seems consistent. Since this is a directory, can you check if the number of entries inside it and its names is the same on all three bricks ?
I am fine deleting the whole
logs.bak2
directory to get rid of these entries pending heal. Would deleting the directory on the mounted volume be effective in solving the issue for these files?
Yes, that should work. If there's some inconsistency inside the directory itself maybe it's not possible to completely remove the entire directory at once, but only few problematic entries should remain, which can be easily handled.
However, you could also use this case to "experiment" with healing files if you want.
The data on all 3 bricks seems consistent. Since this is a directory, can you check if the number of entries inside it and its names is the same on all three bricks ?
It is not:
[hostname21 logs.bak2]$ ls|wc -l
340803
[hostname21 logs.bak2]$ find . | wc -l
18465612
[hostname22 logs.bak2]$ ls|wc -l
340803
[hostname22 logs.bak2]$ find . | wc -l
18462649
However, you could also use this case to "experiment" with healing files if you want.
I think it may be challenging to select files to heal from this directory, since the files in need of repair in it keep changing. For example, the file that I mentioned last time no longer appears in the output of gluster vol heal vol_name info
. Could it be in an unstable state where Gluster could be creating and deleting files in this directory to try to align the bricks but the changes don't converge?
The other possibility is that there is a process actively writing to this directory, but I am not aware of any such process and cannot locate it. I've tried looking through the list of processes and listing the open files through lsof
, which only shows gluster processes accessing the directory.
However, you could also use this case to "experiment" with healing files if you want.
I think it may be challenging to select files to heal from this directory, since the files in need of repair in it keep changing. For example, the file that I mentioned last time no longer appears in the output of
gluster vol heal vol_name info
. Could it be in an unstable state where Gluster could be creating and deleting files in this directory to try to align the bricks but the changes don't converge?
That shouldn't happen if there isn't anyone accessing the volume (once a file is healed, nothing else inside Gluster touches it). Could it be that the heal full you triggered 4 days ago is still working and finding (and healing) files that were broken but not tracked internally (maybe because of the inode exhaustion that happened some time ago). You can check the glusershd.log on each node to see if there's ongoing activity.
You can check the glusershd.log on each node to see if there's ongoing activity.
There continues to be ongoing activity. It may indeed be due to the full heal that I triggered, since the count of files in need of heal is not changing much, so I'm guessing it's just new files that keep being detected in need of heal. For example:
[hostname20 glusterfs]$ date; tail glustershd.log
Sat Feb 12 02:06:10 UTC 2022
[2022-02-12 02:06:09.018375] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on 51b71c79-7088-4521-b8a3-afcd5fd275e8. sources=[2] sinks=0 1
[2022-02-12 02:06:09.020454] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on 51b71c79-7088-4521-b8a3-afcd5fd275e8
[2022-02-12 02:06:09.044517] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on 51b71c79-7088-4521-b8a3-afcd5fd275e8. sources=[2] sinks=0 1
[2022-02-12 02:06:09.068995] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 3-vol_name-replicate-0: performing metadata selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec
[2022-02-12 02:06:09.074776] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec. sources=[2] sinks=0 1
[2022-02-12 02:06:09.076833] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec
[2022-02-12 02:06:09.096778] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec. sources=[2] sinks=0 1
[2022-02-12 02:06:09.109857] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 3-vol_name-replicate-0: performing metadata selfheal on 08376a6d-823c-4750-8f33-2a5782dfd84c
[2022-02-12 02:06:09.115797] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on 08376a6d-823c-4750-8f33-2a5782dfd84c. sources=[2] sinks=0 1
[2022-02-12 02:06:09.117873] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on 08376a6d-823c-4750-8f33-2a5782dfd84c
I am fine deleting the whole logs.bak2 directory to get rid of these entries pending heal. Would deleting the directory on the mounted volume be effective in solving the issue for these files?
Yes, that should work. If there's some inconsistency inside the directory itself maybe it's not possible to completely remove the entire directory at once, but only few problematic entries should remain, which can be easily handled.
All of the non-gfid entries in gluster vol heal vol_name info
continue to be from the logs.bak2
directory, which I don't really need. Since the entries are varying all the time I don't think they will be useful to practice recovering files, so I'm going to delete this directory on the mounted filesystem and see what remains to be fixed.
Hi @xhernandez, I'm just providing a quick update.
Removal of logs.bak2
directory has been ongoing for the last few days. Most top level directories within that directory got wiped on the first run (which took a couple of days), but as you had warned there were a number of top level directories that did not get removed (~7000 of them). The rm -rf
command showed errors like:
rm: cannot remove ‘logs.bak2/TW9YX12003182/J198476T56175213’: Transport endpoint is not connected
I have continued re-running rm -rf logs.bak2
from within the mounted volume and I'm now down to 1200 top level directories.
At the same time the heal-count has been reduced, which seems to indicate that a lot of the problematic files were in this directory. The exact number changes quickly, but it's now in the hundreds rather than the 2475 that I had before starting deletion of logs.bak2
Hi @dalwise. Sorry, I couldn't answer earlier.
I would say that as self-heal is progressing, more directories can be removed. If all damaged files and directories are inside logs.bak2
and self-heal keeps progressing, you should be able to delete the remaining files soon. Let me know if something gets stuck.
Removal of
logs.bak2
directory has been ongoing for the last few days. Most top level directories within that directory got wiped on the first run (which took a couple of days), but as you had warned there were a number of top level directories that did not get removed (~7000 of them). Therm -rf
command showed errors like:rm: cannot remove ‘logs.bak2/TW9YX12003182/J198476T56175213’: Transport endpoint is not connected
Does this error persist, or after some time it was possible to remove it also ?
Hi @xhernandez,
The Transport endpoint is not connected
error prevented me from deleting all entries within logs.bak2
. I ended up running the following within the mounted directory. It reduced the number of entries very slowly, but never got below 11:
while ls -ld logs.bak2; do date; rm -rf logs.bak2; done
A few hours after stopping that command the entries in this directory of the mounted volume have increased again:
# ls logs.bak2/ | wc -l
4828
One thing that I have noted is that the number of entries in the brick of hostname21 and hostname22 matches that of the mounted volume. However the brick of hostname20 has many more entries:
# ls /shared/.brick/logs.bak2/ | wc -l
52105
Could Gluster be restoring entries from there? glustershd.log
on hostname20 shows ongoing activity and heal-count
shows hundreds on entries on hostname20 (none on the others)
I am currently trying the rm -rf
command in loop again, but it seems like the directory has entries being created. I still haven't found any process outside of Gluster that may be doing so.
Hi @xhernandez,
The
Transport endpoint is not connected
error prevented me from deleting all entries withinlogs.bak2
. I ended up running the following within the mounted directory. It reduced the number of entries very slowly, but never got below 11:while ls -ld logs.bak2; do date; rm -rf logs.bak2; done
A few hours after stopping that command the entries in this directory of the mounted volume have increased again:
# ls logs.bak2/ | wc -l 4828
That's weird. Once deleted even self-heal shouldn't create directories again.
Can you run gluster volume status <volname> clients
and gluster volume status <volname> client-list
. It will show who is connected to the bricks. Only the self-heal daemon should appear. If so, you may tray to kill the self-heal process, remove everything you can, and start it again using gluster volume start <volname> force
.
One thing that I have noted is that the number of entries in the brick of hostname21 and hostname22 matches that of the mounted volume. However the brick of hostname20 has many more entries:
# ls /shared/.brick/logs.bak2/ | wc -l 52105
Could Gluster be restoring entries from there?
glustershd.log
on hostname20 shows ongoing activity andheal-count
shows hundreds on entries on hostname20 (none on the others)
It's the only option I see. Though if self-heal considers that the contents of that brick are good to be replicated to the other bricks, it should also show those contents when you list the directory through the mount point. Can you provide the output of the following commands from all bricks ?
# stat /shared/.brick/logs.bak2
# getfattr -m. -e hex -d /shared/.brick/logs.bak2
@karthik-us @pranithk do you know why self-heal may be recovering data from a brick that is not used to show the contents from the mount point ?
Thanks for your help @xhernandez.
Before reading your message I kept trying to remove directories, but they have nonetheless increased. Currently all bricks have the same number of entries. Trying to delete them from within the mounted filesystem results in many Transport endpoint is not connected
errors:
[hostname20 vol_name]$ ls logs.bak2/|wc -l
45568
[hostname20 ]$ ls /shared/.brick/logs.bak2/ | wc -l
45568
[hostname21 ~]$ ls /shared/.brick/logs.bak2/ | wc -l
45568
[hostname22 ~]$ ls /shared/.brick/logs.bak2/ | wc -l
45568
Can you run gluster volume status
clients and gluster volume status client-list. It will show who is connected to the bricks. Only the self-heal daemon should appear. If so, you may tray to kill the self-heal process, remove everything you can, and start it again using gluster volume start force.
Besides the self-heal daemons the fuse mount on each of the 3 nodes is also shown. hostname20
also had an additional fuse mount on another mount point that I have unmounted, but it didn't seem to be in use. I assume that I'll need a fuse mount (at least on one node) to be able to rm the directories. Or did you mean to remove them directly on the bricks? Is the start command only to restart self-heal or did you intend me to run gluster volume stop <volname> force
at some earlier point?
[hostname20 ~]$ gluster volume status vol_name clients
Client connections for volume vol_name
----------------------------------------------
Brick : hostname21:/shared/.brick
Clients connected : 6
Hostname BytesRead BytesWritten OpVersion
-------- --------- ------------ ---------
10:32::21:49151 212263659412 248686626360 70200
10:32::22:49151 198986307580 205689698096 70200
10:32::1:49137 386254711468 428001593072 70200
10:32::1:49129 244098818470 1620181422056 70200
10:32::21:49145 12953535376 28427954548 70200
10:32::22:49144 67992 67436 70200
----------------------------------------------
Brick : hostname22:/shared/.brick
Clients connected : 6
Hostname BytesRead BytesWritten OpVersion
-------- --------- ------------ ---------
10:32::21:49143 195517413400 200070163428 70200
10:32::22:49150 215749549792 251727554036 70200
10:32::1:49141 386439663160 426006159192 70200
10:32::1:49127 236743313241 730263668496 70200
10:32::21:49142 63979330377 142047192344 70200
10:32::22:49141 67992 66864 70200
----------------------------------------------
Brick : hostname20:/shared/.brick
Clients connected : 6
Hostname BytesRead BytesWritten OpVersion
-------- --------- ------------ ---------
10:32::20:49141 348078654668 407153848100 70200
10:32::21:49144 69623738472 64403543140 70200
10:32::21:49141 84327676813 71605132816 70200
10:32::22:49143 69862449248 65162147332 70200
10:32::20:49133 111578892 96524416 70200
10:32::22:49140 65128 58184 70200
----------------------------------------------
[hostname20 ~]$ gluster volume status vol_name client-list
Client connections for volume vol_name
Name count
----- ------
glustershd 3
fuse 3
total clients for volume vol_name : 6
-----------------------------------------------------------------
[hostname20 ~]$
Can you provide the output of the following commands from all bricks ?
[hostname20 ~]$ stat /shared/.brick/logs.bak2
File: ‘/shared/.brick/logs.bak2’
Size: 11071488 Blocks: 4168 IO Block: 4096 directory
Device: fd00h/64768d Inode: 1879049437 Links: 45570
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-16 21:56:56.290330204 +0000
Modify: 2022-02-16 13:19:24.070714399 +0000
Change: 2022-02-16 13:35:35.601728222 +0000
Birth: -
[hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4
[hostname21 ~]$ stat /shared/.brick/logs.bak2
File: ‘/shared/.brick/logs.bak2’
Size: 1482752 Blocks: 4024 IO Block: 4096 directory
Device: fd00h/64768d Inode: 4913161408 Links: 45570
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-16 15:32:40.938630891 +0000
Modify: 2022-02-16 13:19:24.070489184 +0000
Change: 2022-02-16 13:35:35.599780013 +0000
Birth: -
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4
[hostname22 ~]$ stat /shared/.brick/logs.bak2
File: ‘/shared/.brick/logs.bak2’
Size: 1482752 Blocks: 4024 IO Block: 4096 directory
Device: fd00h/64768d Inode: 805307078 Links: 45570
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-16 15:33:15.305998830 +0000
Modify: 2022-02-16 13:19:24.070489184 +0000
Change: 2022-02-16 13:35:35.599523053 +0000
Birth: -
[hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4
Thanks for your help @xhernandez.
Before reading your message I kept trying to remove directories, but they have nonetheless increased. Currently all bricks have the same number of entries. Trying to delete them from within the mounted filesystem results in many
Transport endpoint is not connected
errors:[hostname20 vol_name]$ ls logs.bak2/|wc -l 45568 [hostname20 ]$ ls /shared/.brick/logs.bak2/ | wc -l 45568
[hostname21 ~]$ ls /shared/.brick/logs.bak2/ | wc -l 45568
[hostname22 ~]$ ls /shared/.brick/logs.bak2/ | wc -l 45568
It seems that finally self-heal was able to synchronize all bricks. After that, the number of entries in heal info has stabilized ?
Now you are unable to delete any of those entries ?
One possible reason for getting Transport endpoint is not connected
is that replicate considers that the only good source for an entry is in the arbiter brick. In that case the request is denied with that error. Given that hostname20
is the arbiter and it had (apparently) many entries that were pending to be synchronized to the other bricks, that explanation seems reasonable.
If that's the case, this would mean that the directories that cannot be removed because of this error, most probably have discrepancies between the bricks in the contents inside them, and probably self-heal is still working and the number of files pending heal shouldn't be stable yet.
Can you run gluster volume status clients and gluster volume status client-list. It will show who is connected to the bricks. Only the self-heal daemon should appear. If so, you may tray to kill the self-heal process, remove everything you can, and start it again using gluster volume start force.
Besides the self-heal daemons the fuse mount on each of the 3 nodes is also shown.
hostname20
also had an additional fuse mount on another mount point that I have unmounted, but it didn't seem to be in use. I assume that I'll need a fuse mount (at least on one node) to be able to rm the directories.
True. Forgot that "small" detail... :-/
Or did you mean to remove them directly on the bricks?
Unless it's strictly necessary, I prefer to do it through the mount point.
Is the start command only to restart self-heal or did you intend me to run
gluster volume stop <volname> force
at some earlier point?
No. start force
checks the state of all related processes (bricks and self-heal daemon in this case) and restarts any that may have been stopped even if the volume is actually started. However, given that the number of entries in all bricks have synchronized, it's obvious self-heal is working and progressing. Let's not stop it for now.
Can you provide the output of the following commands from all bricks ?
[hostname20 ~]$ stat /shared/.brick/logs.bak2 File: ‘/shared/.brick/logs.bak2’ Size: 11071488 Blocks: 4168 IO Block: 4096 directory Device: fd00h/64768d Inode: 1879049437 Links: 45570 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-02-16 21:56:56.290330204 +0000 Modify: 2022-02-16 13:19:24.070714399 +0000 Change: 2022-02-16 13:35:35.601728222 +0000 Birth: - [hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2 getfattr: Removing leading '/' from absolute path names # file: shared/.brick/logs.bak2 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_name-client-0=0x000000000000000000000000 trusted.afr.vol_name-client-1=0x000000000000000000000000 trusted.gfid=0x1ce9de988117431082a26b27b7d33a83 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4
[hostname21 ~]$ stat /shared/.brick/logs.bak2 File: ‘/shared/.brick/logs.bak2’ Size: 1482752 Blocks: 4024 IO Block: 4096 directory Device: fd00h/64768d Inode: 4913161408 Links: 45570 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-02-16 15:32:40.938630891 +0000 Modify: 2022-02-16 13:19:24.070489184 +0000 Change: 2022-02-16 13:35:35.599780013 +0000 Birth: - [hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2 getfattr: Removing leading '/' from absolute path names # file: shared/.brick/logs.bak2 trusted.gfid=0x1ce9de988117431082a26b27b7d33a83 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4
[hostname22 ~]$ stat /shared/.brick/logs.bak2 File: ‘/shared/.brick/logs.bak2’ Size: 1482752 Blocks: 4024 IO Block: 4096 directory Device: fd00h/64768d Inode: 805307078 Links: 45570 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-02-16 15:33:15.305998830 +0000 Modify: 2022-02-16 13:19:24.070489184 +0000 Change: 2022-02-16 13:35:35.599523053 +0000 Birth: - [hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2 getfattr: Removing leading '/' from absolute path names # file: shared/.brick/logs.bak2 trusted.gfid=0x1ce9de988117431082a26b27b7d33a83 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4
Everything seems fine here.
Can you do the same on a directory that cannot be removed because of the Transport endpoint is not connected
?
It seems that finally self-heal was able to synchronize all bricks. After that, the number of entries in heal info has stabilized ?
The number of entries in the logs.bak2 directory has been stable at 45501 for the last hour (same number in all 3 bricks). It likely decreased slightly with the last rm -rf
cycle. But the entries pending heal have now ballooned back up and are quickly oscillating over time:
# gluster vol heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful
Brick host_name21:/shared/.brick
Number of entries: 0
Brick host_name22:/shared/.brick
Number of entries: 0
Brick host_name20:/shared/.brick
Number of entries: 1818300
Now you are unable to delete any of those entries ?
I've tried several cycles of rm -rf
in loop, which sometimes reduces the entry count somewhat, but it never reached 0.
However, given that the number of entries in all bricks have synchronized, it's obvious self-heal is working and progressing. Let's not stop it for now.
Ok
Can you do the same on a directory that cannot be removed because of the Transport endpoint is not connected ?
# rm -rf logs.bak2/TWZYX13500552/J212763T62103841
rm: cannot remove ‘logs.bak2/TWZYX13500552/J212763T62103841’: Transport endpoint is not connected
[hostname20 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’
Size: 94 Blocks: 0 IO Block: 4096 directory
Device: fd00h/64768d Inode: 4162260697 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-11 06:16:25.594892002 +0000
Modify: 2021-11-13 11:12:55.333987394 +0000
Change: 2022-02-16 14:03:23.874468286 +0000
Birth: -
[hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
trusted.afr.vol_name-client-0=0x000000000000000100000001
trusted.afr.vol_name-client-1=0x000000000000000100000001
trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618f9db70000000013e83e4200000000618f9db70000000013e83e4200000000614b99ff000000001ac3baf7
[hostname21 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’
Size: 10 Blocks: 0 IO Block: 4096 directory
Device: fd00h/64768d Inode: 1342715961 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-16 14:03:23.873176705 +0000
Modify: 2022-02-16 14:03:23.873176705 +0000
Change: 2022-02-16 14:03:23.873176705 +0000
Birth: -
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9
[hostname22 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’
Size: 6 Blocks: 0 IO Block: 4096 directory
Device: fd00h/64768d Inode: 4263188881 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-16 14:03:23.876353676 +0000
Modify: 2022-02-16 14:03:23.876353676 +0000
Change: 2022-02-16 14:03:23.876353676 +0000
Birth: -
[hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9
Verified that I cannot remove it again:
[hostname20 vol_name]$ rm -rf logs.bak2/TWZYX13500552/J212763T62103841
rm: cannot remove ‘logs.bak2/TWZYX13500552/J212763T62103841’: Transport endpoint is not connected
On a related note I have also run into a Transport endpoint is not connected
issue on another stack. The file itself is not critical, but this other stack is currently in production so I can't experiment with it as much. I bring it up because this is similar to how the issue on the current thread started (although in that case it was with a critical directory, which led to the repair attempts that landed us where we are). So if we knew how to fix that case we may be able to avert all these troubles next time :-)
[hostname17 readiness]# cat host_name19-220217.yml
cat: host_name19-220217.yml: Transport endpoint is not connected
[hostname17 readiness]$ rm host_name19-220217.yml
rm: cannot remove ‘host_name19-220217.yml’: Transport endpoint is not connected
[hostname17 readiness]$ stat /shared/.brick/readiness/host_name19-220217.yml
File: ‘/shared/.brick/readiness/host_name19-220217.yml’
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: fd00h/64768d Inode: 872520704 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-17 20:55:49.980733852 +0000
Modify: 2022-02-17 20:55:49.980733852 +0000
Change: 2022-02-17 20:55:49.992733959 +0000
Birth: -
[hostname17 readiness]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/readiness/host_name19-220217.yml
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000020000000100000000
trusted.gfid=0xb6e057bdcfa140989428fe3c6ea86838
trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c
trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003a75bcf300000000620eb655000000003a75bcf300000000620eb655000000003a75bcf3
[hostname18 ~]$ stat /shared/.brick/readiness/host_name19-220217.yml
File: ‘/shared/.brick/readiness/host_name19-220217.yml’
Size: 569 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 872513587 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-17 20:50:13.184033290 +0000
Modify: 2022-02-17 20:50:07.110999135 +0000
Change: 2022-02-17 20:50:07.111999141 +0000
Birth: -
[hostname18 ~]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/readiness/host_name19-220217.yml
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0xa41a73ab1f094ea88098ee83547fcd88
trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c
trusted.glusterfs.mdata=0x01000000000000000000000000620eb4ff0000000006a449ee00000000620eb4ff0000000006a449ee00000000620eb4ff0000000005d9d6fd
[hostname19 ~]$ stat /shared/.brick/readiness/host_name19-220217.yml
File: ‘/shared/.brick/readiness/host_name19-220217.yml’
Size: 569 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 872426423 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-17 20:56:02.729319314 +0000
Modify: 2022-02-17 20:55:49.992279498 +0000
Change: 2022-02-17 20:55:49.992279498 +0000
Birth: -
[hostname19 ~]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/readiness/host_name19-220217.yml
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000020000000100000000
trusted.gfid=0xb6e057bdcfa140989428fe3c6ea86838
trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c
trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003b231a7500000000620eb655000000003b231a7500000000620eb655000000003a75bcf3
It seems that finally self-heal was able to synchronize all bricks. After that, the number of entries in heal info has stabilized ?
The number of entries in the logs.bak2 directory has been stable at 45501 for the last hour (same number in all 3 bricks). It likely decreased slightly with the last
rm -rf
cycle. But the entries pending heal have now ballooned back up and are quickly oscillating over time:# gluster vol heal vol_name statistics heal-count Gathering count of entries to be healed on volume vol_name has been successful Brick host_name21:/shared/.brick Number of entries: 0 Brick host_name22:/shared/.brick Number of entries: 0 Brick host_name20:/shared/.brick Number of entries: 1818300
That doesn't seem normal. I think that the rm
itself could be causing some race condition when self-heal is also trying to heal the same directory (see comments below).
Can you do the same on a directory that cannot be removed because of the Transport endpoint is not connected ?
# rm -rf logs.bak2/TWZYX13500552/J212763T62103841 rm: cannot remove ‘logs.bak2/TWZYX13500552/J212763T62103841’: Transport endpoint is not connected
[hostname20 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841 File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’ Size: 94 Blocks: 0 IO Block: 4096 directory Device: fd00h/64768d Inode: 4162260697 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-02-11 06:16:25.594892002 +0000 Modify: 2021-11-13 11:12:55.333987394 +0000 Change: 2022-02-16 14:03:23.874468286 +0000 Birth: - [hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841 getfattr: Removing leading '/' from absolute path names # file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841 trusted.afr.vol_name-client-0=0x000000000000000100000001 trusted.afr.vol_name-client-1=0x000000000000000100000001 trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x01000000000000000000000000618f9db70000000013e83e4200000000618f9db70000000013e83e4200000000614b99ff000000001ac3baf7
[hostname21 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841 File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’ Size: 10 Blocks: 0 IO Block: 4096 directory Device: fd00h/64768d Inode: 1342715961 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-02-16 14:03:23.873176705 +0000 Modify: 2022-02-16 14:03:23.873176705 +0000 Change: 2022-02-16 14:03:23.873176705 +0000 Birth: - [hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841 getfattr: Removing leading '/' from absolute path names # file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841 trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9
[hostname22 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841 File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’ Size: 6 Blocks: 0 IO Block: 4096 directory Device: fd00h/64768d Inode: 4263188881 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-02-16 14:03:23.876353676 +0000 Modify: 2022-02-16 14:03:23.876353676 +0000 Change: 2022-02-16 14:03:23.876353676 +0000 Birth: - [hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841 getfattr: Removing leading '/' from absolute path names # file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841 trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9
Verified that I cannot remove it again:
[hostname20 vol_name]$ rm -rf logs.bak2/TWZYX13500552/J212763T62103841 rm: cannot remove ‘logs.bak2/TWZYX13500552/J212763T62103841’: Transport endpoint is not connected
Based on this data, it seems that the directory in arbiter is healthy and the directories on the other two bricks are bad. They only contain the gfid, which is not correct. Also, the arbiter has non-zero xattrs trusted.afr.vol_name-client-*, which means that it has applied some change that is not present in the other bricks, thus requiring heal.
The Transport endpoint is not connected
is caused because Gluster doesn't trust the arbiter brick when it's the only good source of data (it should only be used to disambiguate in case of inconsistencies in the other 2 bricks).
@pranithk @karthik-us will self-heal be able to heal this case ? could it be explained by a race between rm
and self-heal ?
On a related note I have also run into a
Transport endpoint is not connected
issue on another stack. The file itself is not critical, but this other stack is currently in production so I can't experiment with it as much. I bring it up because this is similar to how the issue on the current thread started (although in that case it was with a critical directory, which led to the repair attempts that landed us where we are). So if we knew how to fix that case we may be able to avert all these troubles next time :-)[hostname17 readiness]# cat host_name19-220217.yml cat: host_name19-220217.yml: Transport endpoint is not connected [hostname17 readiness]$ rm host_name19-220217.yml rm: cannot remove ‘host_name19-220217.yml’: Transport endpoint is not connected
[hostname17 readiness]$ stat /shared/.brick/readiness/host_name19-220217.yml File: ‘/shared/.brick/readiness/host_name19-220217.yml’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fd00h/64768d Inode: 872520704 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-02-17 20:55:49.980733852 +0000 Modify: 2022-02-17 20:55:49.980733852 +0000 Change: 2022-02-17 20:55:49.992733959 +0000 Birth: - [hostname17 readiness]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml getfattr: Removing leading '/' from absolute path names # file: shared/.brick/readiness/host_name19-220217.yml trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_name-client-0=0x000000020000000100000000 trusted.gfid=0xb6e057bdcfa140989428fe3c6ea86838 trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003a75bcf300000000620eb655000000003a75bcf300000000620eb655000000003a75bcf3
[hostname18 ~]$ stat /shared/.brick/readiness/host_name19-220217.yml File: ‘/shared/.brick/readiness/host_name19-220217.yml’ Size: 569 Blocks: 8 IO Block: 4096 regular file Device: fd00h/64768d Inode: 872513587 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-02-17 20:50:13.184033290 +0000 Modify: 2022-02-17 20:50:07.110999135 +0000 Change: 2022-02-17 20:50:07.111999141 +0000 Birth: - [hostname18 ~]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml getfattr: Removing leading '/' from absolute path names # file: shared/.brick/readiness/host_name19-220217.yml trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0xa41a73ab1f094ea88098ee83547fcd88 trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c trusted.glusterfs.mdata=0x01000000000000000000000000620eb4ff0000000006a449ee00000000620eb4ff0000000006a449ee00000000620eb4ff0000000005d9d6fd
[hostname19 ~]$ stat /shared/.brick/readiness/host_name19-220217.yml File: ‘/shared/.brick/readiness/host_name19-220217.yml’ Size: 569 Blocks: 8 IO Block: 4096 regular file Device: fd00h/64768d Inode: 872426423 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-02-17 20:56:02.729319314 +0000 Modify: 2022-02-17 20:55:49.992279498 +0000 Change: 2022-02-17 20:55:49.992279498 +0000 Birth: - [hostname19 ~]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml getfattr: Removing leading '/' from absolute path names # file: shared/.brick/readiness/host_name19-220217.yml trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_name-client-0=0x000000020000000100000000 trusted.gfid=0xb6e057bdcfa140989428fe3c6ea86838 trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003b231a7500000000620eb655000000003b231a7500000000620eb655000000003a75bcf3
There are 2 issues here. The first one is that trusted.gfid
is not equal in node hostname18
. This is like the inode number, so it must match in all bricks. This inconsistency shouldn't easily happen on a replica 3 with arbiter. Are you aware of any special issue that could have happened (most probably between 20:50 and 20:55 of 17th of February), like multiple disconnections ?
The second issue is that trusted.glusterfs.mdata
differs between hostname17
and hostname19
. This basically means that all 3 bricks have different metadata, so it's not possible to find a majority. I think this problem could be related to this: https://github.com/gluster/glusterfs/issues/3238. This is not critical as long as the other 2 bricks have consistent data, but in this case we have a second inconsistency, making automatic heal impossible.
The easiest way to fix it in this case is to make sure that the contents of the file in hostname19
are the most up to date (or at least equal to the contents in hostname18
). If so, the file and its gfid in hostname18
can be deleted (for higher safety, you can backup it before deleting, including the good copy in hostname19
).
# rm -f /shared/.brick/readiness/host_name19-220217.yml /shared/.brick/.glusterfs/a4/1a/a41a73ab1f094ea88098ee83547fcd88
Then reset the mismatching xattr in the arbiter brick (hostname17
):
# setfattr -n trusted.glusterfs.mdata -v 0x01000000000000000000000000620eb655000000003b231a7500000000620eb655000000003b231a7500000000620eb655000000003a75bcf3 /shared/.brick/readiness/host_name19-220217.yml
After that, self-heal should be able to successfully heal it and the file should be accessible again.
Hi @xhernandez,
I had not done anything else on the logs.bak2
directory since last time I wrote. However the heal-count has now reached 0 on all nodes. The previous sample directory which I could not remove (/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
) now seems consistent on all 3 nodes, which seems odd, as I don't know how it could have recovered from the arbiter.
Unfortunately I don't know what has made a difference for it other than waiting. Is there anything you'd like me to check to debug what the issue was? Otherwise I'll try to remove the whole logs.bak2
directory again.
Regarding the other stack:
Are you aware of any special issue that could have happened (most probably between 20:50 and 20:55 of 17th of February), like multiple disconnections ?
I do see issues in the logs a bit earlier:
The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name19-201023.yml>, 0f1703b1-b623-4203-84ab-788bb3311a12 on vol_name-client-2 and 5ae416c8-8ad2-4b6b-b1b9-cfd57d3d4c84 on vol_name-client-0." repeated 3 times between [2022-02-13 20:40:56.552787] and [2022-02-13 20:41:25.010350]
The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name17-210216.yml>, c0ba9832-a2b3-497a-8b99-f32f081a96f5 on vol_name-client-1 and 2aad63b1-f9c5-435f-bb0f-3ff1bd8aae26 on vol_name-client-0." repeated 5 times between [2022-02-13 20:40:57.521178] and [2022-02-13 20:41:25.653126]
The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name18-210216.yml>, e3ddfa99-c276-435a-9429-13e59f39ef66 on vol_name-client-1 and 75858077-4098-44f8-b830-c754ff117251 on vol_name-client-0." repeated 5 times between [2022-02-13 20:40:57.421407] and [2022-02-13 20:41:25.754143]
Every node generates a status report with a datestamp within the "readiness" directory of the Gluster every 5 minutes. They all go into the same directory, but each host generates the file with its own hostname. They all happen at about the same time, as they are controlled by the same cron entry. I wonder if having each host overwrite a different file within the same directory at the same time is creating some race condition.
# rm -f /shared/.brick/readiness/host_name19-220217.yml /shared/.brick/.glusterfs/a4/1a/a41a73ab1f094ea88098ee83547fcd88
I ran this and was then able to access the file again. I didn't need to reset the xattr. Note that I only deleted the first file: the file in .glusterfs
didn't exist.
The file on all three bricks was set by Gluster to:
trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003b231a7500000000620eb655000000003b231a7500000000620eb655000000003a75bcf3
Hi @xhernandez,
I had not done anything else on the
logs.bak2
directory since last time I wrote. However the heal-count has now reached 0 on all nodes. The previous sample directory which I could not remove (/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
) now seems consistent on all 3 nodes, which seems odd, as I don't know how it could have recovered from the arbiter.Unfortunately I don't know what has made a difference for it other than waiting. Is there anything you'd like me to check to debug what the issue was? Otherwise I'll try to remove the whole
logs.bak2
directory again.
I think self-heal was still working, and once it was able to fix the pending files, everything became stable again. Since the only issue was in metadata, arbiter brick is good enough to be used as the source for the healing. The client side is not so intelligent though, and it doesn't allow access to the file in case of inconsistency, even if the issue is in the metadata. This is something that we should probably have to improve.
Given that the problem is solved, I would say you can try to remove the directory. What I don't understand is what caused the issue, but self-heal fixed it, so we are good. I'll try to do some tests to try to reproduce this issue when I have some time.
Regarding the other stack:
Are you aware of any special issue that could have happened (most probably between 20:50 and 20:55 of 17th of February), like multiple disconnections ?
I do see issues in the logs a bit earlier:
The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name19-201023.yml>, 0f1703b1-b623-4203-84ab-788bb3311a12 on vol_name-client-2 and 5ae416c8-8ad2-4b6b-b1b9-cfd57d3d4c84 on vol_name-client-0." repeated 3 times between [2022-02-13 20:40:56.552787] and [2022-02-13 20:41:25.010350] The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name17-210216.yml>, c0ba9832-a2b3-497a-8b99-f32f081a96f5 on vol_name-client-1 and 2aad63b1-f9c5-435f-bb0f-3ff1bd8aae26 on vol_name-client-0." repeated 5 times between [2022-02-13 20:40:57.521178] and [2022-02-13 20:41:25.653126] The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name18-210216.yml>, e3ddfa99-c276-435a-9429-13e59f39ef66 on vol_name-client-1 and 75858077-4098-44f8-b830-c754ff117251 on vol_name-client-0." repeated 5 times between [2022-02-13 20:40:57.421407] and [2022-02-13 20:41:25.754143]
Every node generates a status report with a datestamp within the "readiness" directory of the Gluster every 5 minutes. They all go into the same directory, but each host generates the file with its own hostname. They all happen at about the same time, as they are controlled by the same cron entry. I wonder if having each host overwrite a different file within the same directory at the same time is creating some race condition.
It shouldn't happen, even if they create the file at the same time. This problem could appear after recreating the file when one of the bricks is down (or disconnected) and something doesn't work well when self-heal tries to repair it.
Did the problem appear once or there are many files in this situation ? you may need to fix them in the same way we fixed the previous one and see if it happens again.
# rm -f /shared/.brick/readiness/host_name19-220217.yml /shared/.brick/.glusterfs/a4/1a/a41a73ab1f094ea88098ee83547fcd88
I ran this and was then able to access the file again. I didn't need to reset the xattr. Note that I only deleted the first file: the file in
.glusterfs
didn't exist.
Maybe that was the reason self-heal couldn't fix it. I'll try to do some tests about this also.
The file on all three bricks was set by Gluster to:
trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003b231a7500000000620eb655000000003b231a7500000000620eb655000000003a75bcf3
That's good. In that case the file is healthy now. Nothing else needs to be done.
Hi @xhernandez,
Given that the problem is solved, I would say you can try to remove the directory. What I don't understand is what caused the issue, but self-heal fixed it, so we are good. I'll try to do some tests to try to reproduce this issue when I have some time.
Removal progressed quite slowly and took a few days, but it is now removed and heal count is at 0. At this point everything seems OK. I assume it should be OK to now restore the original Gluster volume settings and restore the stack to production. Would you suggest doing any other test before this? I'm not yet sure what caused the original problem and whether it can reoccur, but if it does thanks to you I am now better equipped to deal with it.
Did the problem appear once or there are many files in this situation ? you may need to fix them in the same way we fixed the previous one and see if it happens again.
Just one file that I'm aware of.
Maybe that was the reason self-heal couldn't fix it. I'll try to do some tests about this also.
Thanks an thank you for all your help throughout the long recovery process!
Hi @xhernandez,
Given that the problem is solved, I would say you can try to remove the directory. What I don't understand is what caused the issue, but self-heal fixed it, so we are good. I'll try to do some tests to try to reproduce this issue when I have some time.
Removal progressed quite slowly and took a few days, but it is now removed and heal count is at 0. At this point everything seems OK. I assume it should be OK to now restore the original Gluster volume settings and restore the stack to production. Would you suggest doing any other test before this? I'm not yet sure what caused the original problem and whether it can reoccur, but if it does thanks to you I am now better equipped to deal with it.
That's great !
Yes, I think everything should be fine now. If you want to be 100% sure, you could run a find on all bricks (excluding .glusterfs) and you should see exactly the same files (assuming no one is creating or deleting files during the finds).
Did the problem appear once or there are many files in this situation ? you may need to fix them in the same way we fixed the previous one and see if it happens again.
Just one file that I'm aware of.
It's not a common case. Maybe it would be better to monitor the logs for some time in case the issue appears again and be able to react quickly and also see what happened (disconnections, self-heal in progress, ...) that could help explain it.
Maybe that was the reason self-heal couldn't fix it. I'll try to do some tests about this also.
Thanks an thank you for all your help throughout the long recovery process!
You're welcome :)
Yes, I think everything should be fine now. If you want to be 100% sure, you could run a find on all bricks (excluding .glusterfs) and you should see exactly the same files (assuming no one is creating or deleting files during the finds).
I was taking a look and there continue to be oddities around the logs.bak2
directory. Heal count is showing 0, but:
[hostname20 .brick]$ ls logs.bak2
QMNF0ES214502F5
[hostname21 .brick]$ ls logs.bak2
ls: cannot access logs.bak2: No such file or directory
[hostname22 .brick]$ ls logs.bak2
TWZYX13301979 TWZYX13302913
[hostname21 .brick]$ ls /shared/vol_name/logs.bak2
TWZYX13301979 TWZYX13302913
After trying another rm -rf logs.bak2
on the mounted volume I again got:
[hostname21 ]$ ls /shared/vol_name/logs.bak2
ls: cannot access /shared/vol_name/logs.bak2: Transport endpoint is not connected
And waiting a few minutes resulted in:
[hostname21 .brick]$ ls /shared/vol_name/logs.bak2
QMNF0ES214502F5
I did another round of rm -rf
and this time the directory seems to be gone from the mount and all the bricks. Hopefully it doesn't come back yet again.
As for other differences on the bricks I am currently dumping the list of files, but I can already see that the list is much smaller on the arbiter compared to the data bricks, is this expected?
Yes, I think everything should be fine now. If you want to be 100% sure, you could run a find on all bricks (excluding .glusterfs) and you should see exactly the same files (assuming no one is creating or deleting files during the finds).
I was taking a look and there continue to be oddities around the
logs.bak2
directory. Heal count is showing 0, but:[hostname20 .brick]$ ls logs.bak2 QMNF0ES214502F5
[hostname21 .brick]$ ls logs.bak2 ls: cannot access logs.bak2: No such file or directory
[hostname22 .brick]$ ls logs.bak2 TWZYX13301979 TWZYX13302913
[hostname21 .brick]$ ls /shared/vol_name/logs.bak2 TWZYX13301979 TWZYX13302913
After trying another
rm -rf logs.bak2
on the mounted volume I again got:[hostname21 ]$ ls /shared/vol_name/logs.bak2 ls: cannot access /shared/vol_name/logs.bak2: Transport endpoint is not connected
And waiting a few minutes resulted in:
[hostname21 .brick]$ ls /shared/vol_name/logs.bak2 QMNF0ES214502F5
I did another round of
rm -rf
and this time the directory seems to be gone from the mount and all the bricks. Hopefully it doesn't come back yet again.
That's really weird. These entries should have appeared in heal info, at least.
As for other differences on the bricks I am currently dumping the list of files, but I can already see that the list is much smaller on the arbiter compared to the data bricks, is this expected?
The number of entries outside .glusterfs
in the arbiter brick should be equal to the other bricks. Have you checked what's missing ?
It seems clear that there's some issue here. I'm not sure what it could be yet, but we have fixed several things since version 7.4 that may be causing problems here. Would it be possible to upgrade at least to version 9 ?
The number of entries outside .glusterfs in the arbiter brick should be equal to the other bricks. Have you checked what's missing ?
I gathered the file list in all 3 nodes as with:
cd /shared/.brick
find . -path ./.glusterfs -prune -o -print > ~/gluster_files.txt
cd
sort gluster_files.txt > gluster_files_ordered.txt
The count of files is:
[hostname20 ~]$ wc -l gluster_files_ordered.txt
649105 gluster_files_ordered.txt
[hostname21 ~]$ wc -l gluster_files_ordered.txt
112938782 gluster_files_ordered.txt
[hostname22 ~]$ wc -l gluster_files_ordered.txt
112938552 gluster_files_ordered.txt
A diff between the hostname21 and hostname22 lists shows a few hundred log files with differences. For example this file:
[hostname20 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br
stat: cannot stat ‘logs/TW8YX10513067/J175402T47456720/results.json.br’: No such file or directory
[hostname21 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br
stat: cannot stat ‘logs/TW8YX10513067/J175402T47456720/results.json.br’: No such file or directory
[hostname22 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br
File: ‘logs/TW8YX10513067/J175402T47456720/results.json.br’
Size: 299 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 688421909 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-01-07 15:34:01.336866855 +0000
Modify: 2021-04-15 05:46:57.714252008 +0000
Change: 2021-04-15 05:46:57.715252012 +0000
Birth: -
[hostname22 .brick]$ getfattr -m. -e hex -d logs/TW8YX10513067/J175402T47456720/results.json.br
# file: logs/TW8YX10513067/J175402T47456720/results.json.br
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0xe3073009e811406fa860c926ea326028
trusted.gfid2path.7ebe7cf71d9c4f7f=0x34653061306466342d316234322d346330622d623565352d3432643935393861626362642f726573756c74732e6a736f6e2e6272
trusted.glusterfs.mdata=0x010000000000000000000000006077d351000000002a8fc746000000006077d351000000002a8fc746000000006077d351000000002a772110
Trying to access this file in the mounted volume results in:
$ ls logs/TW8YX10513067/J175402T47456720/results.json.br
ls: cannot access logs/TW8YX10513067/J175402T47456720/results.json.br: Transport endpoint is not connected
I checked that all these files are older ones and I'm ok removing them, since they were already backed up offline. Can I simply remove them from hostname21 & hostname22 bricks?
hostname20 however is missing many other files that do exist on the mounted volume and that I do need. For example:
[hostname20 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br
stat: cannot stat ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’: No such file or directory
[hostname21 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br
File: ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’
Size: 327 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 1559889781 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-03-04 01:39:27.512929550 +0000
Modify: 2021-12-02 08:36:19.310082241 +0000
Change: 2022-02-24 23:34:26.003662633 +0000
Birth: -
[hostname21 .brick]$ getfattr -m. -e hex -d logs/QMNF0ES213309C2/J224342T66211349/results.json.br
# file: logs/QMNF0ES213309C2/J224342T66211349/results.json.br
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0x86651e4531644391902cc4c477380207
trusted.glusterfs.mdata=0x010000000000000000000000006218160200000000002ca0780000000061a8858300000000126800ca0000000061a8858300000000125518b8
[hostname22 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br
File: ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’
Size: 327 Blocks: 8 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 485767957 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-03-04 01:39:27.511800175 +0000
Modify: 2021-12-02 08:36:19.308272446 +0000
Change: 2022-02-24 23:34:26.003675585 +0000
Birth: -
[hostname22 .brick]$ getfattr -m. -e hex -d logs/QMNF0ES213309C2/J224342T66211349/results.json.br
# file: logs/QMNF0ES213309C2/J224342T66211349/results.json.br
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0x86651e4531644391902cc4c477380207
trusted.glusterfs.mdata=0x010000000000000000000000006218160200000000002ca0780000000061a8858300000000126800ca0000000061a8858300000000125518b8
Would it be possible to upgrade at least to version 9 ?
I have a large number of instances in use and cannot upgrade all currently or have a variety of versions at the same time. If absolutely needed I could upgrade this particular stack to version 9 to fix the issues, but I'd need to return it back to version 7 before putting it back into production. Would there be an issue with such a downgrade?
The number of entries outside .glusterfs in the arbiter brick should be equal to the other bricks. Have you checked what's missing ?
I gathered the file list in all 3 nodes as with:
cd /shared/.brick find . -path ./.glusterfs -prune -o -print > ~/gluster_files.txt cd sort gluster_files.txt > gluster_files_ordered.txt
The count of files is:
[hostname20 ~]$ wc -l gluster_files_ordered.txt 649105 gluster_files_ordered.txt
[hostname21 ~]$ wc -l gluster_files_ordered.txt 112938782 gluster_files_ordered.txt
[hostname22 ~]$ wc -l gluster_files_ordered.txt 112938552 gluster_files_ordered.txt
I don't understand what's happening here. Based on the monitoring you did on the number of inodes on hostnam20, it's impossible that that brick has less than 700.000 entries. When it was working slow it was creating 900 inodes per minute, then it peaked to 50000 per minute for some time. This means several millions after few days.
What removed those entries ? can you see some activity in the self-heal log after the heal was supposedly completed ?
A diff between the hostname21 and hostname22 lists shows a few hundred log files with differences. For example this file:
[hostname20 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br stat: cannot stat ‘logs/TW8YX10513067/J175402T47456720/results.json.br’: No such file or directory
[hostname21 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br stat: cannot stat ‘logs/TW8YX10513067/J175402T47456720/results.json.br’: No such file or directory
[hostname22 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br File: ‘logs/TW8YX10513067/J175402T47456720/results.json.br’ Size: 299 Blocks: 8 IO Block: 4096 regular file Device: fd00h/64768d Inode: 688421909 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-01-07 15:34:01.336866855 +0000 Modify: 2021-04-15 05:46:57.714252008 +0000 Change: 2021-04-15 05:46:57.715252012 +0000 Birth: - [hostname22 .brick]$ getfattr -m. -e hex -d logs/TW8YX10513067/J175402T47456720/results.json.br # file: logs/TW8YX10513067/J175402T47456720/results.json.br trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0xe3073009e811406fa860c926ea326028 trusted.gfid2path.7ebe7cf71d9c4f7f=0x34653061306466342d316234322d346330622d623565352d3432643935393861626362642f726573756c74732e6a736f6e2e6272 trusted.glusterfs.mdata=0x010000000000000000000000006077d351000000002a8fc746000000006077d351000000002a8fc746000000006077d351000000002a772110
Trying to access this file in the mounted volume results in:
$ ls logs/TW8YX10513067/J175402T47456720/results.json.br ls: cannot access logs/TW8YX10513067/J175402T47456720/results.json.br: Transport endpoint is not connected
I checked that all these files are older ones and I'm ok removing them, since they were already backed up offline. Can I simply remove them from hostname21 & hostname22 bricks?
Yes, but you also need to remove the associated gfid from .glusterfs
. I'll attach a script to do that.
hostname20 however is missing many other files that do exist on the mounted volume and that I do need. For example:
[hostname20 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br stat: cannot stat ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’: No such file or directory
[hostname21 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br File: ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’ Size: 327 Blocks: 8 IO Block: 4096 regular file Device: fd00h/64768d Inode: 1559889781 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-03-04 01:39:27.512929550 +0000 Modify: 2021-12-02 08:36:19.310082241 +0000 Change: 2022-02-24 23:34:26.003662633 +0000 Birth: - [hostname21 .brick]$ getfattr -m. -e hex -d logs/QMNF0ES213309C2/J224342T66211349/results.json.br # file: logs/QMNF0ES213309C2/J224342T66211349/results.json.br trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_name-client-2=0x000000000000000000000000 trusted.gfid=0x86651e4531644391902cc4c477380207 trusted.glusterfs.mdata=0x010000000000000000000000006218160200000000002ca0780000000061a8858300000000126800ca0000000061a8858300000000125518b8
[hostname22 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br File: ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’ Size: 327 Blocks: 8 IO Block: 4096 regular file Device: fd00h/64768d Inode: 485767957 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-03-04 01:39:27.511800175 +0000 Modify: 2021-12-02 08:36:19.308272446 +0000 Change: 2022-02-24 23:34:26.003675585 +0000 Birth: - [hostname22 .brick]$ getfattr -m. -e hex -d logs/QMNF0ES213309C2/J224342T66211349/results.json.br # file: logs/QMNF0ES213309C2/J224342T66211349/results.json.br trusted.afr.dirty=0x000000000000000000000000 trusted.afr.vol_name-client-2=0x000000000000000000000000 trusted.gfid=0x86651e4531644391902cc4c477380207 trusted.glusterfs.mdata=0x010000000000000000000000006218160200000000002ca0780000000061a8858300000000126800ca0000000061a8858300000000125518b8
Can you provide a statedump of all clients (i.e. all mount points and glustershd processes) ?
The statedump is triggered by sending a USR1 signal to the gluster process (glusterfs):
# kill -USR1 <pid>
The statedump is a file created inside /var/run/gluster/ (if this directory doesn't exist, it needs to be created before sending the signal). For client processes, the file is named as glusterdump.<pid>.dump.<timestamp>
This will tell us if all clients are correctly connected to all bricks, among other things.
Would it be possible to upgrade at least to version 9 ?
I have a large number of instances in use and cannot upgrade all currently or have a variety of versions at the same time. If absolutely needed I could upgrade this particular stack to version 9 to fix the issues, but I'd need to return it back to version 7 before putting it back into production. Would there be an issue with such a downgrade?
Downgrading to older versions is not tested and I wouldn't do that if these servers need to return to production.
You can use this script to remove files directly from the brick:
The script needs the root directory of the brick as the first argument. Then you can pass paths relative to the root of the volume as additional arguments or through a pipe.
The script will do several sanity checks before actually trying to remove the entries. By default it runs in "dry mode", so instead of deleting anything, it only shows the commands that it would execute.
To run the script in "delete mode", use something like this:
# GFRM_DELETE=1 ./gfrm.sh /shared/.brick /logs/QMNF0ES213309C2/J224342T66211349/results.json.br
Be sure to do the same on all bricks to avoid adding more inconsistencies. Using this tool may cause significant damage if used incorrectly. In case of doubts, please ask.
I don't understand what's happening here. Based on the monitoring you did on the number of inodes on hostnam20, it's impossible that that brick has less than 700.000 entries. When it was working slow it was creating 900 inodes per minute, then it peaked to 50000 per minute for some time. This means several millions after few days.
What removed those entries ? can you see some activity in the self-heal log after the heal was supposedly completed ?
I never stopped the inode count collection, so I can plot what the creation rate continued to be:
The dips seem aligned with the efforts to remove logs.bak2
. During that same time it's possible that additional files got removed on the hostname20 brick, but I'm not sure how that would have happened. As far as I know I've been operating on the mounted volume (so changes should have been equal on all bricks) and nobody else has been using these nodes. Just a wild guess here, but since the logs.bak2
files were copies of the files in logs
they likely would have had the same hash (and possibly same GTID?). I don't know if in the unstable recovery situation that could have led to also removing files from logs
in addition to logs.bak2
since they had the same hash values.
Can you provide a statedump of all clients (i.e. all mount points and glustershd processes) ?
Sure, I gathered them with kill -USR1 $(pgrep -f glusterfs)
, let me know if there are any other processes that I may need. I've done some minimal post-processing on them to change the volume and host names respectively to "vol_name" and "host_name"
gluster_statedumps.tar.gz
Be sure to do the same on all bricks to avoid adding more inconsistencies. Using this tool may cause significant damage if used incorrectly. In case of doubts, please ask.
The files to remove are only present on one brick (either on hostname21 or hostname22). Should I also run the tool on hostname20 since there shouldn't be anything to remove there? Should I run the same list of files when running on hostname21 and hostname22 or should I only run on each with the list of files that the host actually has?
Thank you very much for your continued support!
I don't understand what's happening here. Based on the monitoring you did on the number of inodes on hostnam20, it's impossible that that brick has less than 700.000 entries. When it was working slow it was creating 900 inodes per minute, then it peaked to 50000 per minute for some time. This means several millions after few days.
What removed those entries ? can you see some activity in the self-heal log after the heal was supposedly completed ?
I never stopped the inode count collection, so I can plot what the creation rate continued to be:
The dips seem aligned with the efforts to remove
logs.bak2
. During that same time it's possible that additional files got removed on the hostname20 brick, but I'm not sure how that would have happened. As far as I know I've been operating on the mounted volume (so changes should have been equal on all bricks) and nobody else has been using these nodes. Just a wild guess here, but since thelogs.bak2
files were copies of the files inlogs
they likely would have had the same hash (and possibly same GTID?). I don't know if in the unstable recovery situation that could have led to also removing files fromlogs
in addition tologs.bak2
since they had the same hash values.
The GFID of each file is created randomly when the file or directory is created, and even if it's a copy, they will have different gfid. Everything else depends on this GFID (it's like the inode number), so logs.bak2
and logs
should be completely independent.
Did you a regular copy or there were directory renames involved ? the only issue I can think of that could explain some kind of cross linked directories needs to involve renames of directories.
Can you check the gfid (trusted.gfid xattr) of the logs.bak2
directory and the directory from where you copied it on all bricks ? they should be consistent on all bricks and different from each other. Also check if the symlink corresponding to the gfid points back to the real directory name (the symlink is located at .glusterfs/xx/yy/
Can you provide a statedump of all clients (i.e. all mount points and glustershd processes) ?
Sure, I gathered them with
kill -USR1 $(pgrep -f glusterfs)
, let me know if there are any other processes that I may need. I've done some minimal post-processing on them to change the volume and host names respectively to "vol_name" and "host_name" gluster_statedumps.tar.gz
The only suspicious thing I see is that client on hostname20 still keeps entries related to logs.bak2
in its cache (I guess you did the rm
from that client). Given that the entries have been deleted, they shouldn't be present in cache.
In your version io-cache
is enabled by default, but on newer versions we have disabled it by default because it doesn't provide much benefit except for some very specific workloads. Additionally, caches are not immediately synchronized, which means that it could cause issues on workloads where multiple clients access the same files/directories.
I think you have some concurrent access to files and directories. If possible, test how it works by disabling io-cache and readdir-ahead, which cache some data that could lead to incosistencies if other clients check the same data soon enough:
# gluster volume set <volname> io-cache off
# gluster volume set <volname> dirread-ahead off
To get rid of the cached info, I would also umount all clients, stop the volume (only glusterd process should remain running in each server) and restart all again.
Be sure to do the same on all bricks to avoid adding more inconsistencies. Using this tool may cause significant damage if used incorrectly. In case of doubts, please ask.
The files to remove are only present on one brick (either on hostname21 or hostname22). Should I also run the tool on hostname20 since there shouldn't be anything to remove there? Should I run the same list of files when running on hostname21 and hostname22 or should I only run on each with the list of files that the host actually has?
The script checks the real path of the file, so if it's not present in one brick, it won't do anything for that file. However you are right that it's not necessary to run the script for files that you already know they are not present. You can do as you prefer. In any case, if you run the script without GFRM_DELETE
it will just show you the list of operation it would run. You can check if there's anything unexpected in that list before running it in delete mode.
I fear that we'll need to heal again hostname20. Otherwise it may not be stable in case another of the bricks is brought down. Once everything is fine, you should start a full heal again:
# gluster volume heal <volname> full
Let it run for some time (heal count should increase), and then you can start other self-heal processes in the servers where heal-count increases if the speed is slow.
Did you a regular copy or there were directory renames involved ? the only issue I can think of that could explain some kind of cross linked directories needs to involve renames of directories.
Yes, just a regular rsync copy and no renames as far as I recall.
Can you check the gfid (trusted.gfid xattr) of the logs.bak2 directory and the directory from where you copied it on all bricks ?
I can't since on March 2nd I succeeded in deleting logs.bak2
, so I no longer have the directory. Before starting the repair procedures I had backed up the brick directories on hostname21 & hostname22, so I tried to check there, but the directories don't have xattrs. I likely used rsync which by default does not copy the xattrs.
The only suspicious thing I see is that client on hostname20 still keeps entries related to logs.bak2 in its cache (I guess you did the rm from that client)
Thanks for looking into it! However note I was doing the rm
from hostname21
I think you have some concurrent access to files and directories. If possible, test how it works by disabling io-cache and readdir-ahead, which cache some data that could lead to incosistencies if other clients check the same data soon enough:
# gluster volume set <volname> io-cache off # gluster volume set <volname> dirread-ahead off
I was able to run the first one, but needed to correct the second one per the prompt:
# gluster volume set vol_name io-cache off
volume set: success
# gluster volume set vol_name dirread-ahead off
volume set: failed: option : dirread-ahead does not exist
Did you mean read-ahead?
# gluster volume set vol_name read-ahead off
volume set: success
To get rid of the cached info, I would also umount all clients, stop the volume (only glusterd process should remain running in each server) and restart all again.
I ran gluster volume stop vol_name
. After that I was left with a glusterd
and a glusterfs
process running in each node. I killed the glusterfs
process in each node manually. Before starting the volume again I'm running gfrm.sh
so that I am sure that there are no changes in the mounted volume at the same time.
When running gfrm.sh
I'm getting a number of "GFID doesn't exist" errors. For example:
[hostname21 ~]$ ./gfrm.sh /shared/.brick logs/TWHT110505056/J189127T52627671
logs/TWHT110505056/J189127T52627671: GFID doesn't exist
Indeed that directory doesn't seem to have an entry in .glusterfs:
[hostname21 ~]$ stat /shared/.brick/logs/TWHT110505056/J189127T52627671/
File: ‘/shared/.brick/logs/TWHT110505056/J189127T52627671/’
Size: 27 Blocks: 0 IO Block: 4096 directory
Device: fd00h/64768d Inode: 633971745 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-03-07 21:27:26.925164633 +0000
Modify: 2021-06-03 05:54:42.381314026 +0000
Change: 2021-12-17 16:51:28.823420402 +0000
Birth: -
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs/TWHT110505056/J189127T52627671/
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs/TWHT110505056/J189127T52627671/
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0x97e0b18ae31c40099792b3478824d552
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x0100000000000000000000000060b86ea20000000016822f580000000060b86ea20000000016822f580000000060b7ab9b00000000320c2ea2
[hostname21 ~]$ stat /shared/.brick/.glusterfs/97/e0/97e0b18a-e31c-4009-9792-b3478824d552
stat: cannot stat ‘/shared/.brick/.glusterfs/97/e0/97e0b18a-e31c-4009-9792-b3478824d552’: No such file or directory
These are always directories. A look at the script seems to show that in this case the directory will not get cleaned. In these cases can I just remove it from the brick with rmdir
(add to the script rmdir "${entry}"
right after the echo "${1}: GFID doesn't exist"
)?
Did you a regular copy or there were directory renames involved ? the only issue I can think of that could explain some kind of cross linked directories needs to involve renames of directories.
Yes, just a regular rsync copy and no renames as far as I recall.
Can you check the gfid (trusted.gfid xattr) of the logs.bak2 directory and the directory from where you copied it on all bricks ?
I can't since on March 2nd I succeeded in deleting
logs.bak2
, so I no longer have the directory. Before starting the repair procedures I had backed up the brick directories on hostname21 & hostname22, so I tried to check there, but the directories don't have xattrs. I likely used rsync which by default does not copy the xattrs.
We have the xattrs from logs.bak2
from this comment: https://github.com/gluster/glusterfs/issues/3072#issuecomment-1042370864.
You should check if the gfid is the same as the source directory of the copy.
The only issue will be to see if the symlink was pointing to the right directory, but if didn't rename that directory, I don't think there was an issue.
The only suspicious thing I see is that client on hostname20 still keeps entries related to logs.bak2 in its cache (I guess you did the rm from that client)
Thanks for looking into it! However note I was doing the
rm
from hostname21
In that case the data on hostname20 was not good. Better to get rid of it.
I think you have some concurrent access to files and directories. If possible, test how it works by disabling io-cache and readdir-ahead, which cache some data that could lead to incosistencies if other clients check the same data soon enough:
# gluster volume set <volname> io-cache off # gluster volume set <volname> dirread-ahead off
I was able to run the first one, but needed to correct the second one per the prompt:
# gluster volume set vol_name io-cache off volume set: success # gluster volume set vol_name dirread-ahead off volume set: failed: option : dirread-ahead does not exist Did you mean read-ahead? # gluster volume set vol_name read-ahead off volume set: success
Sorry. It was a typo. The correct command is:
# gluster volume set <volname> readdir-ahead off
read-ahead
is for data files and may or may not help in your case. For normal FUSE mounts the kernel already does some read-ahead, so using gluster's read ahead could cause big read amplifications in some cases, but as far as I know, there's no bug related to it, so you can keep it enable if it works better for you.
To get rid of the cached info, I would also umount all clients, stop the volume (only glusterd process should remain running in each server) and restart all again.
I ran
gluster volume stop vol_name
. After that I was left with aglusterd
and aglusterfs
process running in each node. I killed theglusterfs
process in each node manually. Before starting the volume again I'm runninggfrm.sh
so that I am sure that there are no changes in the mounted volume at the same time.When running
gfrm.sh
I'm getting a number of "GFID doesn't exist" errors. For example:[hostname21 ~]$ ./gfrm.sh /shared/.brick logs/TWHT110505056/J189127T52627671 logs/TWHT110505056/J189127T52627671: GFID doesn't exist
Indeed that directory doesn't seem to have an entry in .glusterfs:
[hostname21 ~]$ stat /shared/.brick/logs/TWHT110505056/J189127T52627671/ File: ‘/shared/.brick/logs/TWHT110505056/J189127T52627671/’ Size: 27 Blocks: 0 IO Block: 4096 directory Device: fd00h/64768d Inode: 633971745 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-03-07 21:27:26.925164633 +0000 Modify: 2021-06-03 05:54:42.381314026 +0000 Change: 2021-12-17 16:51:28.823420402 +0000 Birth: - [hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs/TWHT110505056/J189127T52627671/ getfattr: Removing leading '/' from absolute path names # file: shared/.brick/logs/TWHT110505056/J189127T52627671/ trusted.afr.vol_name-client-2=0x000000000000000000000000 trusted.gfid=0x97e0b18ae31c40099792b3478824d552 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x0100000000000000000000000060b86ea20000000016822f580000000060b86ea20000000016822f580000000060b7ab9b00000000320c2ea2 [hostname21 ~]$ stat /shared/.brick/.glusterfs/97/e0/97e0b18a-e31c-4009-9792-b3478824d552 stat: cannot stat ‘/shared/.brick/.glusterfs/97/e0/97e0b18a-e31c-4009-9792-b3478824d552’: No such file or directory
These are always directories. A look at the script seems to show that in this case the directory will not get cleaned. In these cases can I just remove it from the brick with
rmdir
(add to the scriptrmdir "${entry}"
right after theecho "${1}: GFID doesn't exist"
)?
Yes, you can. The script tries to be very strict, so unless everything is fine, it doesn't touch anything. If the gfid is not present, removing the directory is fine. However, if the directory is not empty, first you need to remove all its entries using the script to also remove the related gfid's.
The missing gfid is suspicious. It's weird that it happens in several cases. If it only happens with directories, could it be related to the issue you had when you ran out of inodes during a backup ?
We have the xattrs from logs.bak2 from this comment: https://github.com/gluster/glusterfs/issues/3072#issuecomment-1042370864.
Great that you remembered that!
You should check if the gfid is the same as the source directory of the copy.
It is different:
[hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs
trusted.gfid=0xb05a24a1dfcb4157b3ff61524561484c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000
trusted.glusterfs.mdata=0x0100000000000000000000000061b38940000000000a1cbbbe0000000000000000000000000000000000000000000000000000000000000000
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0xb05a24a1dfcb4157b3ff61524561484c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000
trusted.glusterfs.mdata=0x0100000000000000000000000061b38940000000000a1cbbbe0000000000000000000000000000000000000000000000000000000000000000
[hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0xb05a24a1dfcb4157b3ff61524561484c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000
trusted.glusterfs.mdata=0x0100000000000000000000000061b38940000000000a1cbbbe0000000000000000000000000000000000000000000000000000000000000000
In that case the data on hostname20 was not good. Better to get rid of it.
By "get rid of it" you mean making the cache setting changes that we are discussing or something else?
Sorry. It was a typo. The correct command is:
# gluster volume set <volname> readdir-ahead off
Thanks for clarifying. I reverted read-ahead to on and disabled readdir-ahead.
Yes, you can. The script tries to be very strict, so unless everything is fine, it doesn't touch anything. If the gfid is not present, removing the directory is fine. However, if the directory is not empty, first you need to remove all its entries using the script to also remove the related gfid's.
The script is indeed strict. I had to run it through the list of files a few times, since as you said initially it would complain that directories where not (yet) empty. But the important thing is that after running it a few times and then doing the final rmdir
cleanup the file lists on hostname21 and hostname22 are now identical :). For reference, currently:
[hostname21 ~]$ wc -l gluster_files_ordered.txt
112938316 gluster_files_ordered.txt
I've now restarted the volume and launched the full heal operation. Hopefully that recreates the entries on hostname20. So far the heal count remains at 0. I'm also monitoring the inode creation rate. It's been running for a few hours and so far there has been a minimal increase in inode count.
The missing gfid is suspicious. It's weird that it happens in several cases. If it only happens with directories, could it be related to the issue you had when you ran out of inodes during a backup ?
That may well be the case for hostname21, where there were hundreds of missing GFIDs. On hostname22 I had two missing GFIDs even though it never ran out of inodes.
In that case the data on hostname20 was not good. Better to get rid of it.
By "get rid of it" you mean making the cache setting changes that we are discussing or something else?
I meant the cached data, so that the client couldn't use the stale data for anything. Restarting the processes as you did is enough. However, I would keep those caching xlators disabled for now (io-cache and readdir-ahead).
Sorry. It was a typo. The correct command is:
# gluster volume set <volname> readdir-ahead off
Thanks for clarifying. I reverted read-ahead to on and disabled readdir-ahead.
Yes, you can. The script tries to be very strict, so unless everything is fine, it doesn't touch anything. If the gfid is not present, removing the directory is fine. However, if the directory is not empty, first you need to remove all its entries using the script to also remove the related gfid's.
The script is indeed strict. I had to run it through the list of files a few times, since as you said initially it would complain that directories where not (yet) empty. But the important thing is that after running it a few times and then doing the final
rmdir
cleanup the file lists on hostname21 and hostname22 are now identical :). For reference, currently:[hostname21 ~]$ wc -l gluster_files_ordered.txt 112938316 gluster_files_ordered.txt
Great :)
I've now restarted the volume and launched the full heal operation. Hopefully that recreates the entries on hostname20. So far the heal count remains at 0. I'm also monitoring the inode creation rate. It's been running for a few hours and so far there has been a minimal increase in inode count.
It may take some time till it arrives to the directories that have missing entries (it's doing a full scan of the entire volume in no particular order). If you want you can check the self-heal log. It won't report any activity if no issues are found but you should see that the "full sweep" has started and whether any errors has happened.
The missing gfid is suspicious. It's weird that it happens in several cases. If it only happens with directories, could it be related to the issue you had when you ran out of inodes during a backup ?
That may well be the case for hostname21, where there were hundreds of missing GFIDs. On hostname22 I had two missing GFIDs even though it never ran out of inodes.
Let's see if self-heal completes successfully this time and data is correctly replicated.
Hi @xhernandez,
It may take some time till it arrives to the directories that have missing entries (it's doing a full scan of the entire volume in no particular order). If you want you can check the self-heal log. It won't report any activity if no issues are found but you should see that the "full sweep" has started and whether any errors has happened.
Sorry for the delay to get back to you. I was waiting for the sweep to complete and I expected it to take long as you had warned. However I just noticed that it completed soon after launching it and I had missed it earlier.
The log shows a good number of errors related to "client-common.c:2705:client_pre_getxattr_v2" and files on the brick of hostname20 remain at the same number:
[hostname20 .brick]$ find . -path ./.glusterfs -prune -o -print | wc -l
649105
Here is the heal log for reference: glustershd.log-20220309.gz
Do you have any suggestion on how to unblock the heal?
Thanks!
Hi @xhernandez, just a quick ping in case you have any further suggestions on how to get the heal to proceed, as it seems blocked at this point.
hostname20 continues to have much fewer inodes used:
[root@hostname20 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
652148
Whereas hostname21 & hostname22 continue in sync at a much higher number:
[root@hostname21 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
112938373
[root@hostname22 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
112938373
I tried another full heal, but it showed the same type of errors as in my last message and did not change the used inodes on hostname20.
Thanks for all your help!
Description of problem: Heal count keeps increasing. Current status after 13 days of increases is:
The exact command to reproduce the issue: The issue happened after trying to recover from an inaccessible directory (
/shared/vol_name/logs
) within the volume mounted at/shared/vol_name
. Trying to access this directory would return "Transport endpoint not connected" on all clients. Other directories in the mounted volume were not affected.gluster volume heal
showed GFIDs in need of healing and the problematic directory in split brain. We were not able to get the files healed by using the gluster heal commands. We were then able to resolve problems with most GFIDs by removing the corresponding files in the bricks. However one GFID remained in need of healing on host_name20 and we could not determine what file it corresponded to. Since host_name20 just had the arbiter brick we tried removing it:That allowed us to access the directory that we could not see earlier. We then attempted to rejoin the arbiter with a clean brick:
The directory is still accessible, but the number of files in need of healing has been increasing for the last 13 days.
We can likely recover by simply backing up the files, destroying the current volume and then moving the files onto a newly created volume. However we are at a loss as to:
The full output of the command that failed: (heal count command above)
Expected results: Heal count to return to 0.
Mandatory info: - The output of the
gluster volume info
command:- The output of the
gluster volume status
command:- The output of the
gluster volume heal
command: (Sharedheal count
above asheal
contains tens of millions of entries)- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/
https://www.dropbox.com/s/p93wyyztj5bmzk8/logs.tgz
This compressed log file contains the logs for all three server nodes. The server nodes have the volume mounted and so are acting as clients also.
The volume name is the name of an internal project and has been changed to "vol_name" in command outputs and logs. The hostnames are also internal and have been changed to host_name20, host_name21 & host_name22.
**- Is there any crash ? Provide the backtrace and coredump No crash
Additional info:
- The operating system / glusterfs version:
Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration