Heal count has been increasing for days and currently at over 34 million

gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes

https://www.gluster.org

GNU General Public License v2.0

4.72k stars 1.08k forks source link

Heal count has been increasing for days and currently at over 34 million #3072

Open dalwise opened 2 years ago

dalwise commented 2 years ago

Description of problem: Heal count keeps increasing. Current status after 13 days of increases is:

# gluster volume heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful

Brick host_name21:/shared/.brick
Number of entries: 34410002

Brick host_name22:/shared/.brick
Number of entries: 34363886

Brick host_name20:/shared/.brick
Number of entries: 0

The exact command to reproduce the issue: The issue happened after trying to recover from an inaccessible directory (/shared/vol_name/logs) within the volume mounted at /shared/vol_name. Trying to access this directory would return "Transport endpoint not connected" on all clients. Other directories in the mounted volume were not affected.

gluster volume heal showed GFIDs in need of healing and the problematic directory in split brain. We were not able to get the files healed by using the gluster heal commands. We were then able to resolve problems with most GFIDs by removing the corresponding files in the bricks. However one GFID remained in need of healing on host_name20 and we could not determine what file it corresponded to. Since host_name20 just had the arbiter brick we tried removing it:

gluster volume remove-brick vol_name replica 2 host_name20:/shared/.brick force
gluster peer detach host_name20

That allowed us to access the directory that we could not see earlier. We then attempted to rejoin the arbiter with a clean brick:

rm -rf /shared/.brick/*  # Ran this on host_name20
gluster peer probe host_name20
gluster volume add-brick vol_name replica 3 arbiter 1 host_name20:/shared/.brick force

The directory is still accessible, but the number of files in need of healing has been increasing for the last 13 days.

We can likely recover by simply backing up the files, destroying the current volume and then moving the files onto a newly created volume. However we are at a loss as to:

How the initial problem with directory not being accessible happened
How to prevent it from happening again
How to deal with it more efficiently if the problem does happen again

The full output of the command that failed: (heal count command above)

Expected results: Heal count to return to 0.

Mandatory info: - The output of the gluster volume info command:

Volume Name: vol_name
Type: Replicate
Volume ID: 4833820f-4518-4a2e-a3d8-63d6f31c4646
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: host_name21:/shared/.brick
Brick2: host_name22:/shared/.brick
Brick3: host_name20:/shared/.brick (arbiter)
Options Reconfigured:
cluster.self-heal-window-size: 2
cluster.shd-wait-qlength: 2048
disperse.shd-wait-qlength: 2048
cluster.shd-max-threads: 8
cluster.self-heal-daemon: enable
transport.address-family: inet6
nfs.disable: on
performance.client-io-threads: off

- The output of the gluster volume status command:

Status of volume: vol_name
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick host_name21:/shared/.brick                  49153     0          Y       55517
Brick host_name22:/shared/.brick                  49153     0          Y       60313
Brick host_name20:/shared/.brick                  49152     0          Y       59555
Self-heal Daemon on localhost               N/A       N/A        Y       58347
Self-heal Daemon on host_name22                   N/A       N/A        Y       89514
Self-heal Daemon on host_name21                   N/A       N/A        Y       57650

Task Status of Volume vol_name
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command: (Shared heal count above as heal contains tens of millions of entries)

- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

https://www.dropbox.com/s/p93wyyztj5bmzk8/logs.tgz

This compressed log file contains the logs for all three server nodes. The server nodes have the volume mounted and so are acting as clients also.

The volume name is the name of an internal project and has been changed to "vol_name" in command outputs and logs. The hostnames are also internal and have been changed to host_name20, host_name21 & host_name22.

**- Is there any crash ? Provide the backtrace and coredump No crash

Additional info:

- The operating system / glusterfs version:

CentOS Linux release 7.6.1810 (Core)
Gluster 7.4-1

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

xhernandez commented 2 years ago

I've done a quick check of the self-heal logs, and it seems there are no issues. Healing seems to be progressing. There have been around 8 million files healed since 12th of December.

How many files are there in the volume ? is it possible that there are more than 40 million files with many directories ?

Can you check if the number of files in host_name20 is increasing ? if so, it means that self-heal is doing its job, but it's running too slowly. If that's the case, probably you could use the recommendations from issue #3025. You should disable least priority:

# gluster volume set <volname> performance.enable-least-priority no

And you can also start additional self-heal processes as specified in that issue.

IMPORTANT: Issue #3025 is for Gluster 9.4. It should also work in your version, but I recommend that you don't try this in production without having done tests in a lab environment.

dalwise commented 2 years ago

Thanks for your help, Xavi.

How many files are there in the volume ? is it possible that there are more than 40 million files with many directories ?

Yes, from the brick directory on each host:

[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
46536502
[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
27520629

[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
113235113
[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
30124557

[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
113281980
[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
30169389

I am not sure what would explain the discrepancy between host_name21 and host_name22, since there were no files in need of healing on them when host_name20 got removed and nothing outside of gluster should currently be writing to the volume.

Can you check if the number of files in host_name20 is increasing ?

The number of files shown using the command above is slowly increasing, but after two weeks it is still less than half what the other two servers have.

One thing I have noticed and was not expecting is that the current brick size on host_name20 (the arbiter) is already larger than it was on the backup of the brick that we made before wiping it:

[root@host_name20 .brick]# du -sh /shared/dot_brick_backup/
4.0G    /shared/dot_brick_backup/
[root@host_name20 .brick]# du -sh /shared/.brick/
5.0G    /shared/.brick/

I'll try the suggestions from #3025. I need the data on the volume, but this stack of servers has been taken out of production until we recover it. So I'm thinking of first backing up the data and then trying out the improvements from #3025 on the affected stack itself. Worst case scenario I have the backed up data and can copy it back to a fresh volume. Would it be OK to back up the data directly from the host_name21 or host_name22 brick? I assume this would be faster than backing up from the mounted volume and there is nothing currently being written on the volume.

One thing that I am uncertain about when following the guidance in #3025 is what server I should be monitoring for IO stats: host_name20 is the one that re-joined the volume and is in need of repair, so I'm not quite sure why heal count shows no files needing healing on it and instead shows them on the remaining two servers.

xhernandez commented 2 years ago

Thanks for your help, Xavi.

How many files are there in the volume ? is it possible that there are more than 40 million files with many directories ?

Yes, from the brick directory on each host:
[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
46536502
[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
27520629
[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
113235113
[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
30124557
[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
113281980
[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
30169389
I am not sure what would explain the discrepancy between host_name21 and host_name22, since there were no files in need of healing on them when host_name20 got removed and nothing outside of gluster should currently be writing to the volume.

If there's new data being added to the volume, it could be explained by the different times the find command was run. Otherwise it's something we will need to check once self-heal completes.

Can you check if the number of files in host_name20 is increasing ?

The number of files shown using the command above is slowly increasing, but after two weeks it is still less than half what the other two servers have.

Gluster needs a significant amount of metadata and locking operations per entry. Given the number of files and directories you have, it's likely that most of the time is spent doing those operations. I think parallelizing the self-help should help here.

One thing I have noticed and was not expecting is that the current brick size on host_name20 (the arbiter) is already larger than it was on the backup of the brick that we made before wiping it:
[root@host_name20 .brick]# du -sh /shared/dot_brick_backup/
4.0G  /shared/dot_brick_backup/
[root@host_name20 .brick]# du -sh /shared/.brick/
5.0G  /shared/.brick/

Are you using XFS as the brick filesystem ? we have seen that XFS does preallocation of disk blocks in some cases to optimize future writes. Since currently self-heal is creating a lot of new entries, this extra reservation of blocks could be causing the overhead. If that's the case, it should fix itself over time once the volume is healed.

I'll try the suggestions from #3025. I need the data on the volume, but this stack of servers has been taken out of production until we recover it. So I'm thinking of first backing up the data and then trying out the improvements from #3025 on the affected stack itself. Worst case scenario I have the backed up data and can copy it back to a fresh volume. Would it be OK to back up the data directly from the host_name21 or host_name22 brick? I assume this would be faster than backing up from the mounted volume and there is nothing currently being written on the volume.

If you plan to restore the data only to a new volume and you are sure that the data in those bricks is healthy, then yes, it should be enough to copy one brick. However, given the discrepancy seen in the outputs of find command, first I would try to understand why that happened to be sure we are not missing something by only copying one brick.

One thing that I am uncertain about when following the guidance in #3025 is what server I should be monitoring for IO stats: host_name20 is the one that re-joined the volume and is in need of repair, so I'm not quite sure why heal count shows no files needing healing on it and instead shows them on the remaining two servers.

The number that heal count shows is not the number of entries that need healing in each brick. It's the number of entries updated in each brick that failed to update in another brick (so needing heal in another brick). When a brick goes down, the heal count of the surviving bricks will start to increase when changes are done to entries that are also present in the missing brick. When the bad brick comes back again, the heal count of that brick will be 0, but the other bricks will have the list of entries that need to be healed in the bad brick. So most of the write activity will happen in the bad brick, even if it has heal count = 0.

dalwise commented 2 years ago

If there's new data being added to the volume, it could be explained by the different times the find command was run. Otherwise it's something we will need to check once self-heal completes.

The hosts were taken out of production soon after the initial problem, on December 10. There should be no data being written to the volume.

Are you using XFS as the brick filesystem ? we have seen that XFS does preallocation of disk blocks in some cases to optimize future writes. Since currently self-heal is creating a lot of new entries, this extra reservation of blocks could be causing the overhead. If that's the case, it should fix itself over time once the volume is healed.

Yes, the brick's filesystem is indeed XFS.

If you plan to restore the data only to a new volume and you are sure that the data in those bricks is healthy, then yes, it should be enough to copy one brick. However, given the discrepancy seen in the outputs of find command, first I would try to understand why that happened to be sure we are not missing something by only copying one brick.

I have been doing an rsync from the mounted volume, but it is progressing at only ~50GB/day. Since the volume has 1.3T the backup would take almost a month. Perhaps I can try backing up both bricks rather than just one before trying the repairs from #3025?

The number that heal count shows is not the number of entries that need healing in each brick. It's the number of entries updated in each brick that failed to update in another brick (so needing heal in another brick). When a brick goes down, the heal count of the surviving bricks will start to increase when changes are done to entries that are also present in the missing brick. When the bad brick comes back again, the heal count of that brick will be 0, but the other bricks will have the list of entries that need to be healed in the bad brick. So most of the write activity will happen in the bad brick, even if it has heal count = 0.

Thank you for your very clear explanation. If most of the activity will happen on the brick of host_name20 then I think it may be challenging to evaluate the effect of changes by looking at I/O stats as in #3025: since host_name20 is an arbiter the amount of data written to it as far as I know is minimal. Do you think we may be better off monitoring the effect of changes with other metrics? Maybe the ones above like the number of entries in heal-count or the results of the find on the brick?

xhernandez commented 2 years ago

The hosts were taken out of production soon after the initial problem, on December 10. There should be no data being written to the volume.

In that case the difference in number of files and directories seems suspicious. Maybe you could compare the output of two finds to see where are the differences.

I have been doing an rsync from the mounted volume, but it is progressing at only ~50GB/day. Since the volume has 1.3T the backup would take almost a month. Perhaps I can try backing up both bricks rather than just one before trying the repairs from #3025?

I think that will be the safest way until we understand where the difference comes from.

Thank you for your very clear explanation. If most of the activity will happen on the brick of host_name20 then I think it may be challenging to evaluate the effect of changes by looking at I/O stats as in #3025: since host_name20 is an arbiter the amount of data written to it as far as I know is minimal. Do you think we may be better off monitoring the effect of changes with other metrics? Maybe the ones above like the number of entries in heal-count or the results of the find on the brick?

Probably the easiest way to monitor the progress of self-heal and be able to compare different configurations is to check the number of used inodes of the brick periodically (df -i <brick path>). It can't be used to know exactly how many files have been created because Gluster uses 2 inodes for directories and 1 inode for other entries, but it's enough to do a comparison, for example using the average number of inodes created per second.

dalwise commented 2 years ago

Thanks Xavi! I'll do the following then:

1) Look for differences in the find of bricks in host_name21 and host_name22 2) Back up the bricks on both hosts 3) Set up monitoring of df -i <brick path> so that we can have a baseline before making changes.

dalwise commented 2 years ago

Hi Xavi, I'm just providing a quick update. The backups are still ongoing, even when backing up directly from the bricks. I ran out of inodes for the backup, so I'm removing .glusterfs directory from the backup.

xhernandez commented 2 years ago

Hi @dalwise. How is it going ?

Is the self-heal still progressing on the arbiter brick or have you stopped it ? if it's still running, probably it should have already identified all pending files to be healed and the number of pending files should be near 80 or 90 million and slowly decreasing. Is that the case ?

Is self-heal log still clean of errors ?

dalwise commented 2 years ago

Hi @xhernandez, thanks for checking in!

Self-heal is ongoing. It is currently at over 64M and increasing by about ~50k per hour.

I had some problems backing up the bricks. I first tried backing up to the same partition as the brick but I ran out of inodes. I'm not sure why this happened as previous to the backup only 44% of the inodes were being used, so even doubling it should not have been an issue. My only theory is that rsync creates temporary inodes and those pushed the inode count.

I have since completed a local backup excluding the .glusterfs directory, which worked ok. I am now in the process of transferring that directory to a different host for offline backup. That has currently backed up ~92M inodes out of ~113M (but I'm using rsync, so based on the previous paragraph I'm not sure whether it will go somewhat over 113M before completing).

As for errors I don't see any on /var/log/glusterfs/glfsheal-vol_name.log in the arbiter node and in fact the last message is old:

[2021-12-29 18:09:48.459408] I [MSGID: 104041] [glfs-resolve.c:954:__glfs_active_subvol] 0-vol_name: switched to graph 716d6e32-302d-3133-3539-39342d323032 (0)
"/var/log/glusterfs/glfsheal-vol_name.log" 952L, 45317C

The remaining two nodes have that file empty.

xhernandez commented 2 years ago

Self-heal is ongoing. It is currently at over 64M and increasing by about ~50k per hour.

I thought it would have discovered all directories by now. I was too optimistic...

I had some problems backing up the bricks. I first tried backing up to the same partition as the brick but I ran out of inodes. I'm not sure why this happened as previous to the backup only 44% of the inodes were being used, so even doubling it should not have been an issue. My only theory is that rsync creates temporary inodes and those pushed the inode count.

Many of the files inside .glusterfs are hardlinks to entries outside it, so unless you use -H option in rsync, the number of inodes created will be much higher.

I have since completed a local backup excluding the .glusterfs directory, which worked ok. I am now in the process of transferring that directory to a different host for offline backup. That has currently backed up ~92M inodes out of ~113M (but I'm using rsync, so based on the previous paragraph I'm not sure whether it will go somewhat over 113M before completing).

Just for reference, directories use two inodes, one for the real directory and one for a special symlink created inside .glusterfs. All other entries use a single inode with two hardlinks.

As for errors I don't see any on /var/log/glusterfs/glfsheal-vol_name.log in the arbiter node and in fact the last message is old:
[2021-12-29 18:09:48.459408] I [MSGID: 104041] [glfs-resolve.c:954:__glfs_active_subvol] 0-fava: switched to graph 716d6e32-302d-3133-3539-39342d323032 (0)
"/var/log/glusterfs/glfsheal-vol_name.log" 952L, 45317C
The remaining two nodes have that file empty.

You need to look at glustershd.log. That's the log used by self-heal during reconstruction of the arbiter.

dalwise commented 2 years ago

Many of the files inside .glusterfs are hardlinks to entries outside it, so unless you use -H option in rsync, the number of inodes created will be much higher.

Just for reference, directories use two inodes, one for the real directory and one for a special symlink created inside .glusterfs. All other entries use a single inode with two hardlinks.

Thanks for the tips!

You need to look at glustershd.log. That's the log used by self-heal during reconstruction of the arbiter.

The file on the arbiter is not showing any errors in the last few days. It seems to be healing fine. Here are the last few messages:

[2022-01-06 06:52:35.799445] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on e64041e0-7d9b-4ef0-b094-072a52d55588
[2022-01-06 06:52:35.801009] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on 0f9bd30f-2fe8-4ddd-90c0-c7eee2882f9f
[2022-01-06 06:52:35.840371] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on 749d8c7c-5348-40a3-83ab-f8fd3d357462
[2022-01-06 06:52:35.841280] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on e13f9126-3cf6-45c9-8dc8-34791ff12917. sources=[2]  sinks=0 1
[2022-01-06 06:52:35.877509] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on dc89a327-a633-4be1-b8b1-363e7a9bfaf4. sources=[2]  sinks=0 1
[2022-01-06 06:52:35.883342] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 6aa9108f-b5c0-44d7-a438-53b274522157. sources=[2]  sinks=0 1
[2022-01-06 06:52:35.884929] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on e64041e0-7d9b-4ef0-b094-072a52d55588. sources=[2]  sinks=0 1
[2022-01-06 06:52:35.885091] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 0f9bd30f-2fe8-4ddd-90c0-c7eee2882f9f. sources=[2]  sinks=0 1
[2022-01-06 06:52:35.892500] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 749d8c7c-5348-40a3-83ab-f8fd3d357462. sources=[2]  sinks=0 1
[2022-01-06 06:54:10.963909] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on 5ed1e5c6-e8a6-4903-9525-da8ac0dc0467. sources= sinks=0 1 2
[2022-01-09 03:43:13.033691] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-09 03:43:13.033783] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-09 03:43:13.035032] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-09 03:43:13.035141] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-12 03:10:13.376457] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-12 03:10:13.376474] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-12 03:10:13.377881] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-12 03:10:13.378020] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-16 03:21:14.222491] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-16 03:21:14.222612] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-16 03:21:14.224039] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-16 03:21:14.224112] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing

The same file on the data nodes is much more chatty, but seems to be fine also. It contains entries like"

[2022-01-18 22:47:43.073863] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on 812e4ac8-7ecd-4a29-8490-cfe3475839c1
[2022-01-18 22:47:43.154375] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 812e4ac8-7ecd-4a29-8490-cfe3475839c1. sources=0 [1]  sinks=2

Both data nodes seem to have no errors:

# egrep -c '\]\sE\s\[' /var/log/glusterfs/glustershd.log
0

Whereas the arbiter just has some errors from January 6 when one of the data nodes was left without available inodes.

dalwise commented 2 years ago

Hi @xhernandez,

The backup completed and I have been applying the solutions that you suggested from #3025, while monitoring the increase in inodes (via df -i). Unfortunately I haven't seen much noticeable improvement:

The black vertical line is when I ran gluster volume set vol_name performance.enable-least-priority no. It made a small improvement (from ~900 to ~920 inodes/minute), but it disappeared over time.

The green vertical lines are for when I started each of 5 additional glusterfs processes.

The purple vertical line is when I ran gluster volume set vol_name cluster.data-self-heal-algorithm full

The dips every hour are from the other monitoring which runs gluster vol heal vol_name statistics heal-count.

ps aux | grep glustershd shows the following. I have followed the pattern from the original glustershd I had running, even though it is slightly different from the one in #3025, which I assume is due to the different Gluster version (mine has asterisks in the --xlator-option argument):

root       9478  0.0  0.0 675796  5848 ?        Ssl  Jan28   0:03 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra2.pid -l /var/log/glusterfs/glustershd-extra2.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root      59414  1.0  0.0 1034436 35664 ?       Ssl   2021 756:10 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/f438aa9983d7f1fd.socket --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root      65548  0.0  0.0 675796  7800 ?        Ssl  Jan28   0:03 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra1.pid -l /var/log/glusterfs/glustershd-extra1.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root      71572  0.0  0.0 675796  7788 ?        Ssl  Jan28   0:02 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra5.pid -l /var/log/glusterfs/glustershd-extra5.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root     128316  0.0  0.0 675796  5828 ?        Ssl  Jan28   0:02 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra4.pid -l /var/log/glusterfs/glustershd-extra4.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root     184165  0.0  0.0 675796  7872 ?        Ssl  Jan28   0:02 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra3.pid -l /var/log/glusterfs/glustershd-extra3.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root     216441  0.0  0.0 112712   976 pts/0    S+   04:28   0:00 grep --color=auto glustershd

xhernandez commented 2 years ago

@dalwise when you do configuration changes, you should also restart the self-heal processes started manually by you because they are not under the control of glusterd to apply the changes.

In any case, if that doesn't help, I would also try to modify these settings (restarting shd processes after that):

Increase cluster.self-heal-window-size to 4 or 8
Increase server.event-threads and client.event-threads to 4 or 6.

Can you post the current configuration settings ?

Are logs from extra self-heal processes showing activity ?

dalwise commented 2 years ago

Thank you @xhernandez for your help and quick response!

@dalwise when you do configuration changes, you should also restart the self-heal processes started manually by you because they are not under the control of glusterd to apply the changes.

Thanks for the tip

In any case, if that doesn't help, I would also try to modify these settings (restarting shd processes after that):

Increase cluster.self-heal-window-size to 4 or 8 Increase server.event-threads and client.event-threads to 4 or 6.

I did all of these. Not much difference though:

Can you post the current configuration settings ?

Sure. I assume you mean the output of gluster volume get vol_name all. Here is it: gluster-settings-220201.txt

Are logs from extra self-heal processes showing activity ?

As far as I can tell they are not. Here are the logs from /var/log/glusterfs/glustershd-extra1.log:

glustershd-extra1.log

The process that is generating it is:

root     152776  0.0  0.0 757724  5724 ?        Ssl  03:18   0:00 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra1.pid -l /var/log/glusterfs/glustershd-extra1.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6

Just to double check that we are on the same page: I have been running the commands you suggested and starting the extra glusterfs processes on the arbiter node, which is the one in need of healing. I haven't made any changes on the existing data nodes. My understanding is that the settings propagate to all nodes since they apply to the volume, but I want to be sure that we are on the same page regarding the extra glusterfs processes.

xhernandez commented 2 years ago

Just to double check that we are on the same page: I have been running the commands you suggested and starting the extra glusterfs processes on the arbiter node, which is the one in need of healing. I haven't made any changes on the existing data nodes. My understanding is that the settings propagate to all nodes since they apply to the volume, but I want to be sure that we are on the same page regarding the extra glusterfs processes.

I missed to comment about this before... sorry.

Self-heal processes are run in each cluster node, and each one takes care of the list of pending files present in its own node only. This means that you should start extra self-heal processes in the other nodes (the ones that have a long list of pending files in gluster volume heal info). Since the arbiter brick was empty initially, it doesn't have damaged files, it's the other nodes that have damaged files pending to be healed into the arbiter brick.

I hope that starting extra processes in the other nodes you should see some increase in throughput. At least you should see activity in the logs.

dalwise commented 2 years ago

Hi @xhernandez, I'm definitely glad that we clarified that and I'm sorry for not bringing it up earlier. Adding extra processes in the other two nodes caused a huge increase in rate of new inodes created on the arbiter for about two hours. After that it has come down again, but is still considerably more than it was before the new processes (~2200/minute vs ~900/minute):

Each green line is when a new process was started on both of the data nodes.

As you can see just adding one additional process improved the stats by over an order of magnitude. The second one also seemed to help, but it's unclear that additional ones made any difference.

I'm not sure why a second process would have such a great effect (much more than double). Is it possible that the settings that I had been applying were not being applied to the process that was already running on each node?

The other metric that I was tracking (files needing healing) peaked while starting the additional processes and is now decreasing and changing much faster than previously. If it continues at this rate it would come down to zero within couple of days:

Wed Feb  2 15:45:01 UTC 2022    Number of entries: 77363403
Wed Feb  2 16:45:01 UTC 2022    Number of entries: 77395054
Wed Feb  2 17:45:01 UTC 2022    Number of entries: 77426850
Wed Feb  2 18:45:01 UTC 2022    Number of entries: 77458946
Wed Feb  2 19:45:01 UTC 2022    Number of entries: 77490137
Wed Feb  2 20:45:01 UTC 2022    Number of entries: 76807711
Wed Feb  2 21:45:01 UTC 2022    Number of entries: 74170119
Wed Feb  2 22:45:01 UTC 2022    Number of entries: 71708505
Wed Feb  2 23:45:01 UTC 2022    Number of entries: 69943704
Thu Feb  3 00:45:01 UTC 2022    Number of entries: 68211289

Thank you very much for your help in speeding up the healing process. I'll continue to monitor the progress. Would you have any guidance as to why the original problem happened and whether there was a better remediation available that may have saved us from the millions of entries in need of healing?

xhernandez commented 2 years ago

Hi @xhernandez, I'm definitely glad that we clarified that and I'm sorry for not bringing it up earlier. Adding extra processes in the other two nodes caused a huge increase in rate of new inodes created on the arbiter for about two hours. After that it has come down again, but is still considerably more than it was before the new processes (~2200/minute vs ~900/minute):

That's great :)

Each green line is when a new process was started on both of the data nodes.

As you can see just adding one additional process improved the stats by over an order of magnitude. The second one also seemed to help, but it's unclear that additional ones made any difference.

I'm not sure why a second process would have such a great effect (much more than double). Is it possible that the settings that I had been applying were not being applied to the process that was already running on each node?

Self-heal daemons controlled by glusterd should have applied the changes immediately, so all settings should have been already in effect. It might be related to the way self-heal works. When a directory is healed, all entries inside it are created, but not healed yet. This means that after healing a directory, many files may have been created and added to the pending heal queue, increasing the pending heal count. This can cause bursts of file creations that could explain why you saw that increment (though I can't tell for sure why the improvement was so huge at the beginning). Once the number of directories pending heal is reduced, file creation rate drops and healers process more files than directories.

The other metric that I was tracking (files needing healing) peaked while starting the additional processes and is now decreasing and changing much faster than previously. If it continues at this rate it would come down to zero within couple of days:
Wed Feb  2 15:45:01 UTC 2022  Number of entries: 77363403
Wed Feb  2 16:45:01 UTC 2022  Number of entries: 77395054
Wed Feb  2 17:45:01 UTC 2022  Number of entries: 77426850
Wed Feb  2 18:45:01 UTC 2022  Number of entries: 77458946
Wed Feb  2 19:45:01 UTC 2022  Number of entries: 77490137
Wed Feb  2 20:45:01 UTC 2022  Number of entries: 76807711
Wed Feb  2 21:45:01 UTC 2022  Number of entries: 74170119
Wed Feb  2 22:45:01 UTC 2022  Number of entries: 71708505
Wed Feb  2 23:45:01 UTC 2022  Number of entries: 69943704
Thu Feb  3 00:45:01 UTC 2022  Number of entries: 68211289

The healing rate seems quite good now. Even though there are not so many file creations, most of the heals are just to set the metadata to the files created during directory heal. Since this is an arbiter brick, no data transfer is needed, so each heal is quite fast and files get removed from the pending list at a good pace.

Thank you very much for your help in speeding up the healing process. I'll continue to monitor the progress. Would you have any guidance as to why the original problem happened and whether there was a better remediation available that may have saved us from the millions of entries in need of healing?

When there's an issue with an specific directory, normally it's because some metadata or extended attribute doesn't match and self heal is unable to find enough consistent bricks to decide how to fix it in an automatic way (it requires more intelligence than what it's implemented inside it). However, after manual inspection of the metadata from each brick, it's normally easy to identify what's wrong and fix it (specially for directories, for files it may be harder if the contents don't match).

In your particular case, I can't tell what happened because I don't have the data before the brick was removed. However, in your first comment, you said that the directory was in split brain. This means that there were inconsistent changes between bricks, without a clear "clean" version of that directory from where to pull the data from the other bricks, so self-heal wasn't able to progress and the directory wasn't accessible.

When you deal with such a big volumes with tens of millions of directories and files, volume wise operations, like a full heal and rebalancing, are going to be slow. So if they can be avoided, they should be avoided. In this case, trying to fix whatever was wrong in that directory would have been much faster (most probably). It should be the first attempt always.

Also, you are using a pretty old version. There have been several heal and consistency improvements over time, besides bug fixes, that make some problems harder to happen, and more cases can be healed. I would recommend to keep the version more up to date to avoid some of the known problems, if possible.

dalwise commented 2 years ago

Thanks for the feedback @xhernandez.

As we were expecting the count of files in need of healing decreased quickly over the last couple of days. But it seems to have now stopped at around 2475. It has remained there for the last 7 hours:

# gluster vol heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful

Brick host_name21:/shared/.brick
Number of entries: 2475

Brick host_name22:/shared/.brick
Number of entries: 1

Brick host_name20:/shared/.brick
Number of entries: 1

I have now killed the extra glusterfs commands on all three nodes, since we no longer seem to have a performance issue and they could complicate checking for errors in logs.

I tried running gluster volume heal vol_name full, but that didn't seem to have any effect.

I am looking for errors in the logs using egrep '\]\sE\s\[' <file>. I've only found recent errors in /var/log/glusterfs/<mount_point>.log, which all look like:

[2022-02-07 19:10:29.206069] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7f3514ff993a] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8221)[0x7f350c3a9221] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8b3a)[0x7f350c3a9b3a] (--> /lib64/libpthread.so.0(+0x7dd5)[0x7f3513e3add5] (--> /lib64/libc.so.6(clone+0x6d)[0x7f3513701ead] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory

But I'm not sure what file it is trying to write to.

I have seen online that doing a stat on every file in the mounted filesystem may trigger a heal, although given the size of the volume that may take a while. Do you have any suggestion on how speed up recovery of those last few thousand files?

dalwise commented 2 years ago

By the way, one thing that I have noticed is that even though the number of entries needing healing on host_name21 have remained fairly stable at around 2475, the entries in host_name20 & host_name20 have varied quite much, despite the fact that nothing should currently be writing to the mounted volume. For example just a few minutes ago:

# gluster vol heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful

Brick host_name21:/shared/.brick
Number of entries: 2474

Brick host_name22:/shared/.brick
Number of entries: 395

Brick host_name20:/shared/.brick
Number of entries: 394

The glustershd log shows heals on all nodes, but the number of entries on host_name21 remains stable. For example:

[root@host_name20 glusterfs]# date
Tue Feb  8 00:16:27 UTC 2022
[root@host_name20 glusterfs]# tail -n3 glustershd.log
[2022-02-08 00:16:28.413337] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on 975cbb07-4584-444e-9ead-81d3405b209f. sources=[2]  sinks=0 1
[2022-02-08 00:16:28.415346] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on 975cbb07-4584-444e-9ead-81d3405b209f
[2022-02-08 00:16:28.435790] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on 975cbb07-4584-444e-9ead-81d3405b209f. sources=[2]  sinks=0 1

xhernandez commented 2 years ago

I tried running gluster volume heal vol_name full, but that didn't seem to have any effect.

This command just starts a full scan of the volume checking every file, which means that it may take some time to reach to the problematic files. However, if the files you want to heal already appear in the heal info, a full scan is not really necessary. Self-heal will already try to heal them. If it can't it means that something is wrong with those files and needs to be checked manually.

I am looking for errors in the logs using egrep '\]\sE\s\[' <file>. I've only found recent errors in /var/log/glusterfs/<mount_point>.log, which all look like:

[2022-02-07 19:10:29.206069] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7f3514ff993a] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8221)[0x7f350c3a9221] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8b3a)[0x7f350c3a9b3a] (--> /lib64/libpthread.so.0(+0x7dd5)[0x7f3513e3add5] (--> /lib64/libc.so.6(clone+0x6d)[0x7f3513701ead] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory

That could happen sometimes when Gluster is trying to invalidate some file that the kernel has already removed from its cache. If there isn't anything else in the logs, I wouldn't care much about this message.

I have seen online that doing a stat on every file in the mounted filesystem may trigger a heal, although given the size of the volume that may take a while. Do you have any suggestion on how speed up recovery of those last few thousand files?

You should run a gluster volume <volname> heal info. It will show you the list of problematic files. Then take one of them and run these commands on it from each brick, not the mount point:

# stat <file>
# getfattr -m. -e hex -d <file>

This will provide the necessary information to fix the that file. Most probably other files will have the a similar problem, so we may be able to apply the same solution to the other files, but first let's see what happens on one of them.

By the way, one thing that I have noticed is that even though the number of entries needing healing on host_name21 have remained fairly stable at around 2475, the entries in host_name20 & host_name20 have varied quite much, despite the fact that nothing should currently be writing to the mounted volume. For example just a few minutes ago:

This is probably caused by the self-heal full command, that has found some additional issues. These could be related to the discrepancies you detected between the two remaining nodes at the beginning, or it could also be related to the problems caused when one of the bricks exhausted all its inodes.

dalwise commented 2 years ago

Most of the entries shown on gluster vol heal vol_name info are gfid entries like <gfid:9d83ace4-e904-4868-be69-0032e05acc5f> however a few are filenames, so I'll start with those. All of the non-gfid entries are in the logs.bak2 directory. Here's one that shows listed both for host_name20 and host_name22:

host_name20# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27          Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 4093747016  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-08 19:12:09.229701706 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.228701706 +0000
 Birth: -
host_name20# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4

host_name21# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27          Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 38842218    Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-05-13 15:18:03.976526580 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.228041551 +0000
 Birth: -
host_name21# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4

host_name22# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27          Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 4129139457  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-05-13 15:18:03.976526580 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.227005682 +0000
 Birth: -
host_name22# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4

I am fine deleting the whole logs.bak2 directory to get rid of these entries pending heal. Would deleting the directory on the mounted volume be effective in solving the issue for these files?

xhernandez commented 2 years ago

host_name20# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27            Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d  Inode: 4093747016  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-08 19:12:09.229701706 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.228701706 +0000
 Birth: -
host_name20# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4

host_name21# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27            Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d  Inode: 38842218    Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-05-13 15:18:03.976526580 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.228041551 +0000
 Birth: -
host_name21# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4

host_name22# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27            Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d  Inode: 4129139457  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-05-13 15:18:03.976526580 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.227005682 +0000
 Birth: -
host_name22# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4

The data on all 3 bricks seems consistent. Since this is a directory, can you check if the number of entries inside it and its names is the same on all three bricks ?

I am fine deleting the whole logs.bak2 directory to get rid of these entries pending heal. Would deleting the directory on the mounted volume be effective in solving the issue for these files?

Yes, that should work. If there's some inconsistency inside the directory itself maybe it's not possible to completely remove the entire directory at once, but only few problematic entries should remain, which can be easily handled.

However, you could also use this case to "experiment" with healing files if you want.

dalwise commented 2 years ago

The data on all 3 bricks seems consistent. Since this is a directory, can you check if the number of entries inside it and its names is the same on all three bricks ?

It is not:

[hostname21 logs.bak2]$ ls|wc -l
340803
[hostname21 logs.bak2]$ find . | wc -l
18465612

[hostname22 logs.bak2]$ ls|wc -l
340803
[hostname22 logs.bak2]$ find . | wc -l
18462649

However, you could also use this case to "experiment" with healing files if you want.

I think it may be challenging to select files to heal from this directory, since the files in need of repair in it keep changing. For example, the file that I mentioned last time no longer appears in the output of gluster vol heal vol_name info. Could it be in an unstable state where Gluster could be creating and deleting files in this directory to try to align the bricks but the changes don't converge?

The other possibility is that there is a process actively writing to this directory, but I am not aware of any such process and cannot locate it. I've tried looking through the list of processes and listing the open files through lsof, which only shows gluster processes accessing the directory.

xhernandez commented 2 years ago

However, you could also use this case to "experiment" with healing files if you want.

I think it may be challenging to select files to heal from this directory, since the files in need of repair in it keep changing. For example, the file that I mentioned last time no longer appears in the output of gluster vol heal vol_name info. Could it be in an unstable state where Gluster could be creating and deleting files in this directory to try to align the bricks but the changes don't converge?

That shouldn't happen if there isn't anyone accessing the volume (once a file is healed, nothing else inside Gluster touches it). Could it be that the heal full you triggered 4 days ago is still working and finding (and healing) files that were broken but not tracked internally (maybe because of the inode exhaustion that happened some time ago). You can check the glusershd.log on each node to see if there's ongoing activity.

dalwise commented 2 years ago

You can check the glusershd.log on each node to see if there's ongoing activity.

There continues to be ongoing activity. It may indeed be due to the full heal that I triggered, since the count of files in need of heal is not changing much, so I'm guessing it's just new files that keep being detected in need of heal. For example:

[hostname20 glusterfs]$ date; tail glustershd.log
Sat Feb 12 02:06:10 UTC 2022
[2022-02-12 02:06:09.018375] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on 51b71c79-7088-4521-b8a3-afcd5fd275e8. sources=[2]  sinks=0 1
[2022-02-12 02:06:09.020454] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on 51b71c79-7088-4521-b8a3-afcd5fd275e8
[2022-02-12 02:06:09.044517] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on 51b71c79-7088-4521-b8a3-afcd5fd275e8. sources=[2]  sinks=0 1
[2022-02-12 02:06:09.068995] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 3-vol_name-replicate-0: performing metadata selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec
[2022-02-12 02:06:09.074776] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec. sources=[2]  sinks=0 1
[2022-02-12 02:06:09.076833] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec
[2022-02-12 02:06:09.096778] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec. sources=[2]  sinks=0 1
[2022-02-12 02:06:09.109857] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 3-vol_name-replicate-0: performing metadata selfheal on 08376a6d-823c-4750-8f33-2a5782dfd84c
[2022-02-12 02:06:09.115797] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on 08376a6d-823c-4750-8f33-2a5782dfd84c. sources=[2]  sinks=0 1
[2022-02-12 02:06:09.117873] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on 08376a6d-823c-4750-8f33-2a5782dfd84c

I am fine deleting the whole logs.bak2 directory to get rid of these entries pending heal. Would deleting the directory on the mounted volume be effective in solving the issue for these files?

Yes, that should work. If there's some inconsistency inside the directory itself maybe it's not possible to completely remove the entire directory at once, but only few problematic entries should remain, which can be easily handled.

All of the non-gfid entries in gluster vol heal vol_name info continue to be from the logs.bak2 directory, which I don't really need. Since the entries are varying all the time I don't think they will be useful to practice recovering files, so I'm going to delete this directory on the mounted filesystem and see what remains to be fixed.

dalwise commented 2 years ago

Hi @xhernandez, I'm just providing a quick update.

Removal of logs.bak2 directory has been ongoing for the last few days. Most top level directories within that directory got wiped on the first run (which took a couple of days), but as you had warned there were a number of top level directories that did not get removed (~7000 of them). The rm -rf command showed errors like:

rm: cannot remove ‘logs.bak2/TW9YX12003182/J198476T56175213’: Transport endpoint is not connected

I have continued re-running rm -rf logs.bak2 from within the mounted volume and I'm now down to 1200 top level directories.

At the same time the heal-count has been reduced, which seems to indicate that a lot of the problematic files were in this directory. The exact number changes quickly, but it's now in the hundreds rather than the 2475 that I had before starting deletion of logs.bak2

xhernandez commented 2 years ago

Hi @dalwise. Sorry, I couldn't answer earlier.

I would say that as self-heal is progressing, more directories can be removed. If all damaged files and directories are inside logs.bak2 and self-heal keeps progressing, you should be able to delete the remaining files soon. Let me know if something gets stuck.

Removal of logs.bak2 directory has been ongoing for the last few days. Most top level directories within that directory got wiped on the first run (which took a couple of days), but as you had warned there were a number of top level directories that did not get removed (~7000 of them). The rm -rf command showed errors like:
rm: cannot remove ‘logs.bak2/TW9YX12003182/J198476T56175213’: Transport endpoint is not connected

Does this error persist, or after some time it was possible to remove it also ?

dalwise commented 2 years ago

Hi @xhernandez,

The Transport endpoint is not connected error prevented me from deleting all entries within logs.bak2. I ended up running the following within the mounted directory. It reduced the number of entries very slowly, but never got below 11:

while ls -ld logs.bak2; do date; rm -rf logs.bak2; done

A few hours after stopping that command the entries in this directory of the mounted volume have increased again:

# ls logs.bak2/ | wc -l
4828

One thing that I have noted is that the number of entries in the brick of hostname21 and hostname22 matches that of the mounted volume. However the brick of hostname20 has many more entries:

# ls /shared/.brick/logs.bak2/ | wc -l
52105

Could Gluster be restoring entries from there? glustershd.log on hostname20 shows ongoing activity and heal-count shows hundreds on entries on hostname20 (none on the others)

I am currently trying the rm -rf command in loop again, but it seems like the directory has entries being created. I still haven't found any process outside of Gluster that may be doing so.

xhernandez commented 2 years ago

Hi @xhernandez,

The Transport endpoint is not connected error prevented me from deleting all entries within logs.bak2. I ended up running the following within the mounted directory. It reduced the number of entries very slowly, but never got below 11:
while ls -ld logs.bak2; do date; rm -rf logs.bak2; done
A few hours after stopping that command the entries in this directory of the mounted volume have increased again:
# ls logs.bak2/ | wc -l
4828

That's weird. Once deleted even self-heal shouldn't create directories again.

Can you run gluster volume status <volname> clients and gluster volume status <volname> client-list. It will show who is connected to the bricks. Only the self-heal daemon should appear. If so, you may tray to kill the self-heal process, remove everything you can, and start it again using gluster volume start <volname> force.

One thing that I have noted is that the number of entries in the brick of hostname21 and hostname22 matches that of the mounted volume. However the brick of hostname20 has many more entries:
# ls /shared/.brick/logs.bak2/ | wc -l
52105
Could Gluster be restoring entries from there? glustershd.log on hostname20 shows ongoing activity and heal-count shows hundreds on entries on hostname20 (none on the others)

It's the only option I see. Though if self-heal considers that the contents of that brick are good to be replicated to the other bricks, it should also show those contents when you list the directory through the mount point. Can you provide the output of the following commands from all bricks ?

# stat /shared/.brick/logs.bak2
# getfattr -m. -e hex -d /shared/.brick/logs.bak2

@karthik-us @pranithk do you know why self-heal may be recovering data from a brick that is not used to show the contents from the mount point ?

dalwise commented 2 years ago

Thanks for your help @xhernandez.

Before reading your message I kept trying to remove directories, but they have nonetheless increased. Currently all bricks have the same number of entries. Trying to delete them from within the mounted filesystem results in many Transport endpoint is not connected errors:

[hostname20 vol_name]$ ls logs.bak2/|wc -l
45568
[hostname20 ]$ ls /shared/.brick/logs.bak2/ | wc -l
45568

[hostname21 ~]$ ls /shared/.brick/logs.bak2/ | wc -l
45568

[hostname22 ~]$ ls /shared/.brick/logs.bak2/ | wc -l
45568

Can you run gluster volume status clients and gluster volume status client-list. It will show who is connected to the bricks. Only the self-heal daemon should appear. If so, you may tray to kill the self-heal process, remove everything you can, and start it again using gluster volume start force.

Besides the self-heal daemons the fuse mount on each of the 3 nodes is also shown. hostname20 also had an additional fuse mount on another mount point that I have unmounted, but it didn't seem to be in use. I assume that I'll need a fuse mount (at least on one node) to be able to rm the directories. Or did you mean to remove them directly on the bricks? Is the start command only to restart self-heal or did you intend me to run gluster volume stop <volname> force at some earlier point?

[hostname20 ~]$ gluster volume status vol_name clients
Client connections for volume vol_name
----------------------------------------------
Brick : hostname21:/shared/.brick
Clients connected : 6
Hostname                                               BytesRead    BytesWritten       OpVersion
--------                                               ---------    ------------       ---------
10:32::21:49151                                     212263659412    248686626360           70200
10:32::22:49151                                     198986307580    205689698096           70200
10:32::1:49137                                      386254711468    428001593072           70200
10:32::1:49129                                      244098818470   1620181422056           70200
10:32::21:49145                                      12953535376     28427954548           70200
10:32::22:49144                                            67992           67436           70200
----------------------------------------------
Brick : hostname22:/shared/.brick
Clients connected : 6
Hostname                                               BytesRead    BytesWritten       OpVersion
--------                                               ---------    ------------       ---------
10:32::21:49143                                     195517413400    200070163428           70200
10:32::22:49150                                     215749549792    251727554036           70200
10:32::1:49141                                      386439663160    426006159192           70200
10:32::1:49127                                      236743313241    730263668496           70200
10:32::21:49142                                      63979330377    142047192344           70200
10:32::22:49141                                            67992           66864           70200
----------------------------------------------
Brick : hostname20:/shared/.brick
Clients connected : 6
Hostname                                               BytesRead    BytesWritten       OpVersion
--------                                               ---------    ------------       ---------
10:32::20:49141                                     348078654668    407153848100           70200
10:32::21:49144                                      69623738472     64403543140           70200
10:32::21:49141                                      84327676813     71605132816           70200
10:32::22:49143                                      69862449248     65162147332           70200
10:32::20:49133                                        111578892        96524416           70200
10:32::22:49140                                            65128           58184           70200
----------------------------------------------

[hostname20 ~]$ gluster volume status vol_name client-list
Client connections for volume vol_name
Name                                                       count
-----                                                     ------
glustershd                                                     3
fuse                                                           3

total clients for volume vol_name : 6
-----------------------------------------------------------------

[hostname20 ~]$

Can you provide the output of the following commands from all bricks ?

[hostname20 ~]$ stat /shared/.brick/logs.bak2
  File: ‘/shared/.brick/logs.bak2’
  Size: 11071488    Blocks: 4168       IO Block: 4096   directory
Device: fd00h/64768d    Inode: 1879049437  Links: 45570
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 21:56:56.290330204 +0000
Modify: 2022-02-16 13:19:24.070714399 +0000
Change: 2022-02-16 13:35:35.601728222 +0000
 Birth: -
[hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4

[hostname21 ~]$ stat /shared/.brick/logs.bak2
  File: ‘/shared/.brick/logs.bak2’
  Size: 1482752     Blocks: 4024       IO Block: 4096   directory
Device: fd00h/64768d    Inode: 4913161408  Links: 45570
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 15:32:40.938630891 +0000
Modify: 2022-02-16 13:19:24.070489184 +0000
Change: 2022-02-16 13:35:35.599780013 +0000
 Birth: -
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4

[hostname22 ~]$ stat /shared/.brick/logs.bak2
  File: ‘/shared/.brick/logs.bak2’
  Size: 1482752     Blocks: 4024       IO Block: 4096   directory
Device: fd00h/64768d    Inode: 805307078   Links: 45570
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 15:33:15.305998830 +0000
Modify: 2022-02-16 13:19:24.070489184 +0000
Change: 2022-02-16 13:35:35.599523053 +0000
 Birth: -
[hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4

xhernandez commented 2 years ago

Thanks for your help @xhernandez.

Before reading your message I kept trying to remove directories, but they have nonetheless increased. Currently all bricks have the same number of entries. Trying to delete them from within the mounted filesystem results in many Transport endpoint is not connected errors:
[hostname20 vol_name]$ ls logs.bak2/|wc -l
45568
[hostname20 ]$ ls /shared/.brick/logs.bak2/ | wc -l
45568
[hostname21 ~]$ ls /shared/.brick/logs.bak2/ | wc -l
45568
[hostname22 ~]$ ls /shared/.brick/logs.bak2/ | wc -l
45568

It seems that finally self-heal was able to synchronize all bricks. After that, the number of entries in heal info has stabilized ?

Now you are unable to delete any of those entries ?

One possible reason for getting Transport endpoint is not connected is that replicate considers that the only good source for an entry is in the arbiter brick. In that case the request is denied with that error. Given that hostname20 is the arbiter and it had (apparently) many entries that were pending to be synchronized to the other bricks, that explanation seems reasonable.

If that's the case, this would mean that the directories that cannot be removed because of this error, most probably have discrepancies between the bricks in the contents inside them, and probably self-heal is still working and the number of files pending heal shouldn't be stable yet.

Can you run gluster volume status clients and gluster volume status client-list. It will show who is connected to the bricks. Only the self-heal daemon should appear. If so, you may tray to kill the self-heal process, remove everything you can, and start it again using gluster volume start force.

Besides the self-heal daemons the fuse mount on each of the 3 nodes is also shown. hostname20 also had an additional fuse mount on another mount point that I have unmounted, but it didn't seem to be in use. I assume that I'll need a fuse mount (at least on one node) to be able to rm the directories.

True. Forgot that "small" detail... :-/

Or did you mean to remove them directly on the bricks?

Unless it's strictly necessary, I prefer to do it through the mount point.

Is the start command only to restart self-heal or did you intend me to run gluster volume stop <volname> force at some earlier point?

No. start force checks the state of all related processes (bricks and self-heal daemon in this case) and restarts any that may have been stopped even if the volume is actually started. However, given that the number of entries in all bricks have synchronized, it's obvious self-heal is working and progressing. Let's not stop it for now.

Can you provide the output of the following commands from all bricks ?

[hostname20 ~]$ stat /shared/.brick/logs.bak2
  File: ‘/shared/.brick/logs.bak2’
  Size: 11071488      Blocks: 4168       IO Block: 4096   directory
Device: fd00h/64768d  Inode: 1879049437  Links: 45570
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 21:56:56.290330204 +0000
Modify: 2022-02-16 13:19:24.070714399 +0000
Change: 2022-02-16 13:35:35.601728222 +0000
 Birth: -
[hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4

[hostname21 ~]$ stat /shared/.brick/logs.bak2
  File: ‘/shared/.brick/logs.bak2’
  Size: 1482752       Blocks: 4024       IO Block: 4096   directory
Device: fd00h/64768d  Inode: 4913161408  Links: 45570
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 15:32:40.938630891 +0000
Modify: 2022-02-16 13:19:24.070489184 +0000
Change: 2022-02-16 13:35:35.599780013 +0000
 Birth: -
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4

[hostname22 ~]$ stat /shared/.brick/logs.bak2
  File: ‘/shared/.brick/logs.bak2’
  Size: 1482752       Blocks: 4024       IO Block: 4096   directory
Device: fd00h/64768d  Inode: 805307078   Links: 45570
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 15:33:15.305998830 +0000
Modify: 2022-02-16 13:19:24.070489184 +0000
Change: 2022-02-16 13:35:35.599523053 +0000
 Birth: -
[hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4

Everything seems fine here.

Can you do the same on a directory that cannot be removed because of the Transport endpoint is not connected ?

dalwise commented 2 years ago

It seems that finally self-heal was able to synchronize all bricks. After that, the number of entries in heal info has stabilized ?

The number of entries in the logs.bak2 directory has been stable at 45501 for the last hour (same number in all 3 bricks). It likely decreased slightly with the last rm -rf cycle. But the entries pending heal have now ballooned back up and are quickly oscillating over time:

# gluster vol heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful

Brick host_name21:/shared/.brick
Number of entries: 0

Brick host_name22:/shared/.brick
Number of entries: 0

Brick host_name20:/shared/.brick
Number of entries: 1818300

Now you are unable to delete any of those entries ?

I've tried several cycles of rm -rf in loop, which sometimes reduces the entry count somewhat, but it never reached 0.

However, given that the number of entries in all bricks have synchronized, it's obvious self-heal is working and progressing. Let's not stop it for now.

Can you do the same on a directory that cannot be removed because of the Transport endpoint is not connected ?

# rm -rf logs.bak2/TWZYX13500552/J212763T62103841
rm: cannot remove ‘logs.bak2/TWZYX13500552/J212763T62103841’: Transport endpoint is not connected

[hostname20 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
  File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’
  Size: 94          Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 4162260697  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-11 06:16:25.594892002 +0000
Modify: 2021-11-13 11:12:55.333987394 +0000
Change: 2022-02-16 14:03:23.874468286 +0000
 Birth: -
[hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
trusted.afr.vol_name-client-0=0x000000000000000100000001
trusted.afr.vol_name-client-1=0x000000000000000100000001
trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618f9db70000000013e83e4200000000618f9db70000000013e83e4200000000614b99ff000000001ac3baf7

[hostname21 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
  File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’
  Size: 10          Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 1342715961  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 14:03:23.873176705 +0000
Modify: 2022-02-16 14:03:23.873176705 +0000
Change: 2022-02-16 14:03:23.873176705 +0000
 Birth: -
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9

[hostname22 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
  File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’
  Size: 6           Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 4263188881  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 14:03:23.876353676 +0000
Modify: 2022-02-16 14:03:23.876353676 +0000
Change: 2022-02-16 14:03:23.876353676 +0000
 Birth: -
[hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9

Verified that I cannot remove it again:

[hostname20 vol_name]$ rm -rf logs.bak2/TWZYX13500552/J212763T62103841
rm: cannot remove ‘logs.bak2/TWZYX13500552/J212763T62103841’: Transport endpoint is not connected

On a related note I have also run into a Transport endpoint is not connected issue on another stack. The file itself is not critical, but this other stack is currently in production so I can't experiment with it as much. I bring it up because this is similar to how the issue on the current thread started (although in that case it was with a critical directory, which led to the repair attempts that landed us where we are). So if we knew how to fix that case we may be able to avert all these troubles next time :-)

[hostname17 readiness]# cat host_name19-220217.yml
cat: host_name19-220217.yml: Transport endpoint is not connected
[hostname17 readiness]$ rm host_name19-220217.yml
rm: cannot remove ‘host_name19-220217.yml’: Transport endpoint is not connected

[hostname17 readiness]$ stat /shared/.brick/readiness/host_name19-220217.yml
  File: ‘/shared/.brick/readiness/host_name19-220217.yml’
  Size: 0           Blocks: 0          IO Block: 4096   regular empty file
Device: fd00h/64768d    Inode: 872520704   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-17 20:55:49.980733852 +0000
Modify: 2022-02-17 20:55:49.980733852 +0000
Change: 2022-02-17 20:55:49.992733959 +0000
 Birth: -
[hostname17 readiness]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/readiness/host_name19-220217.yml
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000020000000100000000
trusted.gfid=0xb6e057bdcfa140989428fe3c6ea86838
trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c
trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003a75bcf300000000620eb655000000003a75bcf300000000620eb655000000003a75bcf3

[hostname18 ~]$ stat /shared/.brick/readiness/host_name19-220217.yml
  File: ‘/shared/.brick/readiness/host_name19-220217.yml’
  Size: 569         Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 872513587   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-17 20:50:13.184033290 +0000
Modify: 2022-02-17 20:50:07.110999135 +0000
Change: 2022-02-17 20:50:07.111999141 +0000
 Birth: -
[hostname18 ~]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/readiness/host_name19-220217.yml
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0xa41a73ab1f094ea88098ee83547fcd88
trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c
trusted.glusterfs.mdata=0x01000000000000000000000000620eb4ff0000000006a449ee00000000620eb4ff0000000006a449ee00000000620eb4ff0000000005d9d6fd

[hostname19 ~]$ stat /shared/.brick/readiness/host_name19-220217.yml
  File: ‘/shared/.brick/readiness/host_name19-220217.yml’
  Size: 569         Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 872426423   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-17 20:56:02.729319314 +0000
Modify: 2022-02-17 20:55:49.992279498 +0000
Change: 2022-02-17 20:55:49.992279498 +0000
 Birth: -
[hostname19 ~]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/readiness/host_name19-220217.yml
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000020000000100000000
trusted.gfid=0xb6e057bdcfa140989428fe3c6ea86838
trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c
trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003b231a7500000000620eb655000000003b231a7500000000620eb655000000003a75bcf3

xhernandez commented 2 years ago

It seems that finally self-heal was able to synchronize all bricks. After that, the number of entries in heal info has stabilized ?

The number of entries in the logs.bak2 directory has been stable at 45501 for the last hour (same number in all 3 bricks). It likely decreased slightly with the last rm -rf cycle. But the entries pending heal have now ballooned back up and are quickly oscillating over time:
# gluster vol heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful

Brick host_name21:/shared/.brick
Number of entries: 0

Brick host_name22:/shared/.brick
Number of entries: 0

Brick host_name20:/shared/.brick
Number of entries: 1818300

That doesn't seem normal. I think that the rm itself could be causing some race condition when self-heal is also trying to heal the same directory (see comments below).

Can you do the same on a directory that cannot be removed because of the Transport endpoint is not connected ?

# rm -rf logs.bak2/TWZYX13500552/J212763T62103841
rm: cannot remove ‘logs.bak2/TWZYX13500552/J212763T62103841’: Transport endpoint is not connected

[hostname20 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
  File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’
  Size: 94            Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d  Inode: 4162260697  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-11 06:16:25.594892002 +0000
Modify: 2021-11-13 11:12:55.333987394 +0000
Change: 2022-02-16 14:03:23.874468286 +0000
 Birth: -
[hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
trusted.afr.vol_name-client-0=0x000000000000000100000001
trusted.afr.vol_name-client-1=0x000000000000000100000001
trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618f9db70000000013e83e4200000000618f9db70000000013e83e4200000000614b99ff000000001ac3baf7

[hostname21 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
  File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’
  Size: 10            Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d  Inode: 1342715961  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 14:03:23.873176705 +0000
Modify: 2022-02-16 14:03:23.873176705 +0000
Change: 2022-02-16 14:03:23.873176705 +0000
 Birth: -
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9

[hostname22 ~]$ stat /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
  File: ‘/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841’
  Size: 6             Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d  Inode: 4263188881  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 14:03:23.876353676 +0000
Modify: 2022-02-16 14:03:23.876353676 +0000
Change: 2022-02-16 14:03:23.876353676 +0000
 Birth: -
[hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841
trusted.gfid=0xb13934b567d04e139edf7db16c60f7c9

Verified that I cannot remove it again:

[hostname20 vol_name]$ rm -rf logs.bak2/TWZYX13500552/J212763T62103841
rm: cannot remove ‘logs.bak2/TWZYX13500552/J212763T62103841’: Transport endpoint is not connected

Based on this data, it seems that the directory in arbiter is healthy and the directories on the other two bricks are bad. They only contain the gfid, which is not correct. Also, the arbiter has non-zero xattrs trusted.afr.vol_name-client-*, which means that it has applied some change that is not present in the other bricks, thus requiring heal.

The Transport endpoint is not connected is caused because Gluster doesn't trust the arbiter brick when it's the only good source of data (it should only be used to disambiguate in case of inconsistencies in the other 2 bricks).

@pranithk @karthik-us will self-heal be able to heal this case ? could it be explained by a race between rm and self-heal ?

[hostname17 readiness]# cat host_name19-220217.yml
cat: host_name19-220217.yml: Transport endpoint is not connected
[hostname17 readiness]$ rm host_name19-220217.yml
rm: cannot remove ‘host_name19-220217.yml’: Transport endpoint is not connected

[hostname17 readiness]$ stat /shared/.brick/readiness/host_name19-220217.yml
  File: ‘/shared/.brick/readiness/host_name19-220217.yml’
  Size: 0             Blocks: 0          IO Block: 4096   regular empty file
Device: fd00h/64768d  Inode: 872520704   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-17 20:55:49.980733852 +0000
Modify: 2022-02-17 20:55:49.980733852 +0000
Change: 2022-02-17 20:55:49.992733959 +0000
 Birth: -
[hostname17 readiness]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/readiness/host_name19-220217.yml
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000020000000100000000
trusted.gfid=0xb6e057bdcfa140989428fe3c6ea86838
trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c
trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003a75bcf300000000620eb655000000003a75bcf300000000620eb655000000003a75bcf3

[hostname18 ~]$ stat /shared/.brick/readiness/host_name19-220217.yml
  File: ‘/shared/.brick/readiness/host_name19-220217.yml’
  Size: 569           Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d  Inode: 872513587   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-17 20:50:13.184033290 +0000
Modify: 2022-02-17 20:50:07.110999135 +0000
Change: 2022-02-17 20:50:07.111999141 +0000
 Birth: -
[hostname18 ~]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/readiness/host_name19-220217.yml
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0xa41a73ab1f094ea88098ee83547fcd88
trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c
trusted.glusterfs.mdata=0x01000000000000000000000000620eb4ff0000000006a449ee00000000620eb4ff0000000006a449ee00000000620eb4ff0000000005d9d6fd

[hostname19 ~]$ stat /shared/.brick/readiness/host_name19-220217.yml
  File: ‘/shared/.brick/readiness/host_name19-220217.yml’
  Size: 569           Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d  Inode: 872426423   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-17 20:56:02.729319314 +0000
Modify: 2022-02-17 20:55:49.992279498 +0000
Change: 2022-02-17 20:55:49.992279498 +0000
 Birth: -
[hostname19 ~]$ getfattr -m. -e hex -d /shared/.brick/readiness/host_name19-220217.yml
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/readiness/host_name19-220217.yml
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000020000000100000000
trusted.gfid=0xb6e057bdcfa140989428fe3c6ea86838
trusted.gfid2path.ecca5d8a1c40101f=0x30666661306366362d643664342d343035372d393930612d3663383330323265623433392f716d6e31392d3232303231372e796d6c
trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003b231a7500000000620eb655000000003b231a7500000000620eb655000000003a75bcf3

There are 2 issues here. The first one is that trusted.gfid is not equal in node hostname18. This is like the inode number, so it must match in all bricks. This inconsistency shouldn't easily happen on a replica 3 with arbiter. Are you aware of any special issue that could have happened (most probably between 20:50 and 20:55 of 17th of February), like multiple disconnections ?

The second issue is that trusted.glusterfs.mdata differs between hostname17 and hostname19. This basically means that all 3 bricks have different metadata, so it's not possible to find a majority. I think this problem could be related to this: https://github.com/gluster/glusterfs/issues/3238. This is not critical as long as the other 2 bricks have consistent data, but in this case we have a second inconsistency, making automatic heal impossible.

The easiest way to fix it in this case is to make sure that the contents of the file in hostname19 are the most up to date (or at least equal to the contents in hostname18). If so, the file and its gfid in hostname18 can be deleted (for higher safety, you can backup it before deleting, including the good copy in hostname19).

# rm -f /shared/.brick/readiness/host_name19-220217.yml /shared/.brick/.glusterfs/a4/1a/a41a73ab1f094ea88098ee83547fcd88

Then reset the mismatching xattr in the arbiter brick (hostname17):

# setfattr -n trusted.glusterfs.mdata -v 0x01000000000000000000000000620eb655000000003b231a7500000000620eb655000000003b231a7500000000620eb655000000003a75bcf3 /shared/.brick/readiness/host_name19-220217.yml

After that, self-heal should be able to successfully heal it and the file should be accessible again.

dalwise commented 2 years ago

Hi @xhernandez,

I had not done anything else on the logs.bak2 directory since last time I wrote. However the heal-count has now reached 0 on all nodes. The previous sample directory which I could not remove (/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841) now seems consistent on all 3 nodes, which seems odd, as I don't know how it could have recovered from the arbiter.

Unfortunately I don't know what has made a difference for it other than waiting. Is there anything you'd like me to check to debug what the issue was? Otherwise I'll try to remove the whole logs.bak2 directory again.

Regarding the other stack:

Are you aware of any special issue that could have happened (most probably between 20:50 and 20:55 of 17th of February), like multiple disconnections ?

I do see issues in the logs a bit earlier:

The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name19-201023.yml>, 0f1703b1-b623-4203-84ab-788bb3311a12 on vol_name-client-2 and 5ae416c8-8ad2-4b6b-b1b9-cfd57d3d4c84 on vol_name-client-0." repeated 3 times between [2022-02-13 20:40:56.552787] and [2022-02-13 20:41:25.010350]
The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name17-210216.yml>, c0ba9832-a2b3-497a-8b99-f32f081a96f5 on vol_name-client-1 and 2aad63b1-f9c5-435f-bb0f-3ff1bd8aae26 on vol_name-client-0." repeated 5 times between [2022-02-13 20:40:57.521178] and [2022-02-13 20:41:25.653126]
The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name18-210216.yml>, e3ddfa99-c276-435a-9429-13e59f39ef66 on vol_name-client-1 and 75858077-4098-44f8-b830-c754ff117251 on vol_name-client-0." repeated 5 times between [2022-02-13 20:40:57.421407] and [2022-02-13 20:41:25.754143]

Every node generates a status report with a datestamp within the "readiness" directory of the Gluster every 5 minutes. They all go into the same directory, but each host generates the file with its own hostname. They all happen at about the same time, as they are controlled by the same cron entry. I wonder if having each host overwrite a different file within the same directory at the same time is creating some race condition.

# rm -f /shared/.brick/readiness/host_name19-220217.yml /shared/.brick/.glusterfs/a4/1a/a41a73ab1f094ea88098ee83547fcd88

I ran this and was then able to access the file again. I didn't need to reset the xattr. Note that I only deleted the first file: the file in .glusterfs didn't exist.

The file on all three bricks was set by Gluster to:

trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003b231a7500000000620eb655000000003b231a7500000000620eb655000000003a75bcf3

xhernandez commented 2 years ago

Hi @xhernandez,

I had not done anything else on the logs.bak2 directory since last time I wrote. However the heal-count has now reached 0 on all nodes. The previous sample directory which I could not remove (/shared/.brick/logs.bak2/TWZYX13500552/J212763T62103841) now seems consistent on all 3 nodes, which seems odd, as I don't know how it could have recovered from the arbiter.

Unfortunately I don't know what has made a difference for it other than waiting. Is there anything you'd like me to check to debug what the issue was? Otherwise I'll try to remove the whole logs.bak2 directory again.

I think self-heal was still working, and once it was able to fix the pending files, everything became stable again. Since the only issue was in metadata, arbiter brick is good enough to be used as the source for the healing. The client side is not so intelligent though, and it doesn't allow access to the file in case of inconsistency, even if the issue is in the metadata. This is something that we should probably have to improve.

Given that the problem is solved, I would say you can try to remove the directory. What I don't understand is what caused the issue, but self-heal fixed it, so we are good. I'll try to do some tests to try to reproduce this issue when I have some time.

Regarding the other stack:

Are you aware of any special issue that could have happened (most probably between 20:50 and 20:55 of 17th of February), like multiple disconnections ?

I do see issues in the logs a bit earlier:
The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name19-201023.yml>, 0f1703b1-b623-4203-84ab-788bb3311a12 on vol_name-client-2 and 5ae416c8-8ad2-4b6b-b1b9-cfd57d3d4c84 on vol_name-client-0." repeated 3 times between [2022-02-13 20:40:56.552787] and [2022-02-13 20:41:25.010350]
The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name17-210216.yml>, c0ba9832-a2b3-497a-8b99-f32f081a96f5 on vol_name-client-1 and 2aad63b1-f9c5-435f-bb0f-3ff1bd8aae26 on vol_name-client-0." repeated 5 times between [2022-02-13 20:40:57.521178] and [2022-02-13 20:41:25.653126]
The message "E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-vol_name-replicate-0: Gfid mismatch detected for <gfid:0ffa0cf6-d6d4-4057-990a-6c83022eb439>/host_name18-210216.yml>, e3ddfa99-c276-435a-9429-13e59f39ef66 on vol_name-client-1 and 75858077-4098-44f8-b830-c754ff117251 on vol_name-client-0." repeated 5 times between [2022-02-13 20:40:57.421407] and [2022-02-13 20:41:25.754143]
Every node generates a status report with a datestamp within the "readiness" directory of the Gluster every 5 minutes. They all go into the same directory, but each host generates the file with its own hostname. They all happen at about the same time, as they are controlled by the same cron entry. I wonder if having each host overwrite a different file within the same directory at the same time is creating some race condition.

It shouldn't happen, even if they create the file at the same time. This problem could appear after recreating the file when one of the bricks is down (or disconnected) and something doesn't work well when self-heal tries to repair it.

Did the problem appear once or there are many files in this situation ? you may need to fix them in the same way we fixed the previous one and see if it happens again.

# rm -f /shared/.brick/readiness/host_name19-220217.yml /shared/.brick/.glusterfs/a4/1a/a41a73ab1f094ea88098ee83547fcd88
I ran this and was then able to access the file again. I didn't need to reset the xattr. Note that I only deleted the first file: the file in .glusterfs didn't exist.

Maybe that was the reason self-heal couldn't fix it. I'll try to do some tests about this also.

The file on all three bricks was set by Gluster to:

trusted.glusterfs.mdata=0x01000000000000000000000000620eb655000000003b231a7500000000620eb655000000003b231a7500000000620eb655000000003a75bcf3

That's good. In that case the file is healthy now. Nothing else needs to be done.

dalwise commented 2 years ago

Hi @xhernandez,

Given that the problem is solved, I would say you can try to remove the directory. What I don't understand is what caused the issue, but self-heal fixed it, so we are good. I'll try to do some tests to try to reproduce this issue when I have some time.

Removal progressed quite slowly and took a few days, but it is now removed and heal count is at 0. At this point everything seems OK. I assume it should be OK to now restore the original Gluster volume settings and restore the stack to production. Would you suggest doing any other test before this? I'm not yet sure what caused the original problem and whether it can reoccur, but if it does thanks to you I am now better equipped to deal with it.

Did the problem appear once or there are many files in this situation ? you may need to fix them in the same way we fixed the previous one and see if it happens again.

Just one file that I'm aware of.

Maybe that was the reason self-heal couldn't fix it. I'll try to do some tests about this also.

Thanks an thank you for all your help throughout the long recovery process!

xhernandez commented 2 years ago

Hi @xhernandez,

Given that the problem is solved, I would say you can try to remove the directory. What I don't understand is what caused the issue, but self-heal fixed it, so we are good. I'll try to do some tests to try to reproduce this issue when I have some time.

Removal progressed quite slowly and took a few days, but it is now removed and heal count is at 0. At this point everything seems OK. I assume it should be OK to now restore the original Gluster volume settings and restore the stack to production. Would you suggest doing any other test before this? I'm not yet sure what caused the original problem and whether it can reoccur, but if it does thanks to you I am now better equipped to deal with it.

That's great !

Yes, I think everything should be fine now. If you want to be 100% sure, you could run a find on all bricks (excluding .glusterfs) and you should see exactly the same files (assuming no one is creating or deleting files during the finds).

Did the problem appear once or there are many files in this situation ? you may need to fix them in the same way we fixed the previous one and see if it happens again.

Just one file that I'm aware of.

It's not a common case. Maybe it would be better to monitor the logs for some time in case the issue appears again and be able to react quickly and also see what happened (disconnections, self-heal in progress, ...) that could help explain it.

Maybe that was the reason self-heal couldn't fix it. I'll try to do some tests about this also.

Thanks an thank you for all your help throughout the long recovery process!

You're welcome :)

dalwise commented 2 years ago

Yes, I think everything should be fine now. If you want to be 100% sure, you could run a find on all bricks (excluding .glusterfs) and you should see exactly the same files (assuming no one is creating or deleting files during the finds).

I was taking a look and there continue to be oddities around the logs.bak2 directory. Heal count is showing 0, but:

[hostname20 .brick]$ ls logs.bak2
QMNF0ES214502F5

[hostname21 .brick]$ ls logs.bak2
ls: cannot access logs.bak2: No such file or directory

[hostname22 .brick]$ ls logs.bak2
TWZYX13301979  TWZYX13302913

[hostname21 .brick]$ ls /shared/vol_name/logs.bak2
TWZYX13301979  TWZYX13302913

After trying another rm -rf logs.bak2 on the mounted volume I again got:

[hostname21 ]$ ls /shared/vol_name/logs.bak2
ls: cannot access /shared/vol_name/logs.bak2: Transport endpoint is not connected

And waiting a few minutes resulted in:

[hostname21 .brick]$ ls /shared/vol_name/logs.bak2
QMNF0ES214502F5

I did another round of rm -rf and this time the directory seems to be gone from the mount and all the bricks. Hopefully it doesn't come back yet again.

As for other differences on the bricks I am currently dumping the list of files, but I can already see that the list is much smaller on the arbiter compared to the data bricks, is this expected?

xhernandez commented 2 years ago

Yes, I think everything should be fine now. If you want to be 100% sure, you could run a find on all bricks (excluding .glusterfs) and you should see exactly the same files (assuming no one is creating or deleting files during the finds).

I was taking a look and there continue to be oddities around the logs.bak2 directory. Heal count is showing 0, but:
[hostname20 .brick]$ ls logs.bak2
QMNF0ES214502F5
[hostname21 .brick]$ ls logs.bak2
ls: cannot access logs.bak2: No such file or directory
[hostname22 .brick]$ ls logs.bak2
TWZYX13301979  TWZYX13302913
[hostname21 .brick]$ ls /shared/vol_name/logs.bak2
TWZYX13301979  TWZYX13302913
After trying another rm -rf logs.bak2 on the mounted volume I again got:
[hostname21 ]$ ls /shared/vol_name/logs.bak2
ls: cannot access /shared/vol_name/logs.bak2: Transport endpoint is not connected
And waiting a few minutes resulted in:
[hostname21 .brick]$ ls /shared/vol_name/logs.bak2
QMNF0ES214502F5
I did another round of rm -rf and this time the directory seems to be gone from the mount and all the bricks. Hopefully it doesn't come back yet again.

That's really weird. These entries should have appeared in heal info, at least.

As for other differences on the bricks I am currently dumping the list of files, but I can already see that the list is much smaller on the arbiter compared to the data bricks, is this expected?

The number of entries outside .glusterfs in the arbiter brick should be equal to the other bricks. Have you checked what's missing ?

It seems clear that there's some issue here. I'm not sure what it could be yet, but we have fixed several things since version 7.4 that may be causing problems here. Would it be possible to upgrade at least to version 9 ?

dalwise commented 2 years ago

The number of entries outside .glusterfs in the arbiter brick should be equal to the other bricks. Have you checked what's missing ?

I gathered the file list in all 3 nodes as with:

cd /shared/.brick
find . -path ./.glusterfs -prune -o -print > ~/gluster_files.txt
cd
sort gluster_files.txt > gluster_files_ordered.txt

The count of files is:

[hostname20 ~]$ wc -l gluster_files_ordered.txt
649105 gluster_files_ordered.txt

[hostname21 ~]$ wc -l gluster_files_ordered.txt
112938782 gluster_files_ordered.txt

[hostname22 ~]$ wc -l gluster_files_ordered.txt
112938552 gluster_files_ordered.txt

A diff between the hostname21 and hostname22 lists shows a few hundred log files with differences. For example this file:

[hostname20 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br
stat: cannot stat ‘logs/TW8YX10513067/J175402T47456720/results.json.br’: No such file or directory

[hostname21 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br
stat: cannot stat ‘logs/TW8YX10513067/J175402T47456720/results.json.br’: No such file or directory

[hostname22 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br
  File: ‘logs/TW8YX10513067/J175402T47456720/results.json.br’
  Size: 299         Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 688421909   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-01-07 15:34:01.336866855 +0000
Modify: 2021-04-15 05:46:57.714252008 +0000
Change: 2021-04-15 05:46:57.715252012 +0000
 Birth: -
[hostname22 .brick]$ getfattr -m. -e hex -d logs/TW8YX10513067/J175402T47456720/results.json.br
# file: logs/TW8YX10513067/J175402T47456720/results.json.br
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0xe3073009e811406fa860c926ea326028
trusted.gfid2path.7ebe7cf71d9c4f7f=0x34653061306466342d316234322d346330622d623565352d3432643935393861626362642f726573756c74732e6a736f6e2e6272
trusted.glusterfs.mdata=0x010000000000000000000000006077d351000000002a8fc746000000006077d351000000002a8fc746000000006077d351000000002a772110

Trying to access this file in the mounted volume results in:

$ ls logs/TW8YX10513067/J175402T47456720/results.json.br
ls: cannot access logs/TW8YX10513067/J175402T47456720/results.json.br: Transport endpoint is not connected

I checked that all these files are older ones and I'm ok removing them, since they were already backed up offline. Can I simply remove them from hostname21 & hostname22 bricks?

hostname20 however is missing many other files that do exist on the mounted volume and that I do need. For example:

[hostname20 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br
stat: cannot stat ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’: No such file or directory

[hostname21 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br
  File: ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’
  Size: 327             Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 1559889781  Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-03-04 01:39:27.512929550 +0000
Modify: 2021-12-02 08:36:19.310082241 +0000
Change: 2022-02-24 23:34:26.003662633 +0000
 Birth: -
[hostname21 .brick]$ getfattr -m. -e hex -d logs/QMNF0ES213309C2/J224342T66211349/results.json.br
# file: logs/QMNF0ES213309C2/J224342T66211349/results.json.br
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0x86651e4531644391902cc4c477380207
trusted.glusterfs.mdata=0x010000000000000000000000006218160200000000002ca0780000000061a8858300000000126800ca0000000061a8858300000000125518b8

[hostname22 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br
  File: ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’
  Size: 327             Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 485767957   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-03-04 01:39:27.511800175 +0000
Modify: 2021-12-02 08:36:19.308272446 +0000
Change: 2022-02-24 23:34:26.003675585 +0000
 Birth: -
[hostname22 .brick]$ getfattr -m. -e hex -d logs/QMNF0ES213309C2/J224342T66211349/results.json.br
# file: logs/QMNF0ES213309C2/J224342T66211349/results.json.br
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0x86651e4531644391902cc4c477380207
trusted.glusterfs.mdata=0x010000000000000000000000006218160200000000002ca0780000000061a8858300000000126800ca0000000061a8858300000000125518b8

Would it be possible to upgrade at least to version 9 ?

I have a large number of instances in use and cannot upgrade all currently or have a variety of versions at the same time. If absolutely needed I could upgrade this particular stack to version 9 to fix the issues, but I'd need to return it back to version 7 before putting it back into production. Would there be an issue with such a downgrade?

xhernandez commented 2 years ago

The number of entries outside .glusterfs in the arbiter brick should be equal to the other bricks. Have you checked what's missing ?

I gathered the file list in all 3 nodes as with:
cd /shared/.brick
find . -path ./.glusterfs -prune -o -print > ~/gluster_files.txt
cd
sort gluster_files.txt > gluster_files_ordered.txt
The count of files is:
[hostname20 ~]$ wc -l gluster_files_ordered.txt
649105 gluster_files_ordered.txt
[hostname21 ~]$ wc -l gluster_files_ordered.txt
112938782 gluster_files_ordered.txt
[hostname22 ~]$ wc -l gluster_files_ordered.txt
112938552 gluster_files_ordered.txt

I don't understand what's happening here. Based on the monitoring you did on the number of inodes on hostnam20, it's impossible that that brick has less than 700.000 entries. When it was working slow it was creating 900 inodes per minute, then it peaked to 50000 per minute for some time. This means several millions after few days.

What removed those entries ? can you see some activity in the self-heal log after the heal was supposedly completed ?

A diff between the hostname21 and hostname22 lists shows a few hundred log files with differences. For example this file:

[hostname20 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br
stat: cannot stat ‘logs/TW8YX10513067/J175402T47456720/results.json.br’: No such file or directory

[hostname21 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br
stat: cannot stat ‘logs/TW8YX10513067/J175402T47456720/results.json.br’: No such file or directory

[hostname22 .brick]$ stat logs/TW8YX10513067/J175402T47456720/results.json.br
  File: ‘logs/TW8YX10513067/J175402T47456720/results.json.br’
  Size: 299           Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d  Inode: 688421909   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-01-07 15:34:01.336866855 +0000
Modify: 2021-04-15 05:46:57.714252008 +0000
Change: 2021-04-15 05:46:57.715252012 +0000
 Birth: -
[hostname22 .brick]$ getfattr -m. -e hex -d logs/TW8YX10513067/J175402T47456720/results.json.br
# file: logs/TW8YX10513067/J175402T47456720/results.json.br
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0xe3073009e811406fa860c926ea326028
trusted.gfid2path.7ebe7cf71d9c4f7f=0x34653061306466342d316234322d346330622d623565352d3432643935393861626362642f726573756c74732e6a736f6e2e6272
trusted.glusterfs.mdata=0x010000000000000000000000006077d351000000002a8fc746000000006077d351000000002a8fc746000000006077d351000000002a772110

Trying to access this file in the mounted volume results in:

$ ls logs/TW8YX10513067/J175402T47456720/results.json.br
ls: cannot access logs/TW8YX10513067/J175402T47456720/results.json.br: Transport endpoint is not connected

I checked that all these files are older ones and I'm ok removing them, since they were already backed up offline. Can I simply remove them from hostname21 & hostname22 bricks?

Yes, but you also need to remove the associated gfid from .glusterfs. I'll attach a script to do that.

hostname20 however is missing many other files that do exist on the mounted volume and that I do need. For example:

[hostname20 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br
stat: cannot stat ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’: No such file or directory

[hostname21 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br
  File: ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’
  Size: 327             Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 1559889781  Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-03-04 01:39:27.512929550 +0000
Modify: 2021-12-02 08:36:19.310082241 +0000
Change: 2022-02-24 23:34:26.003662633 +0000
 Birth: -
[hostname21 .brick]$ getfattr -m. -e hex -d logs/QMNF0ES213309C2/J224342T66211349/results.json.br
# file: logs/QMNF0ES213309C2/J224342T66211349/results.json.br
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0x86651e4531644391902cc4c477380207
trusted.glusterfs.mdata=0x010000000000000000000000006218160200000000002ca0780000000061a8858300000000126800ca0000000061a8858300000000125518b8

[hostname22 .brick]$ stat logs/QMNF0ES213309C2/J224342T66211349/results.json.br
  File: ‘logs/QMNF0ES213309C2/J224342T66211349/results.json.br’
  Size: 327             Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 485767957   Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-03-04 01:39:27.511800175 +0000
Modify: 2021-12-02 08:36:19.308272446 +0000
Change: 2022-02-24 23:34:26.003675585 +0000
 Birth: -
[hostname22 .brick]$ getfattr -m. -e hex -d logs/QMNF0ES213309C2/J224342T66211349/results.json.br
# file: logs/QMNF0ES213309C2/J224342T66211349/results.json.br
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0x86651e4531644391902cc4c477380207
trusted.glusterfs.mdata=0x010000000000000000000000006218160200000000002ca0780000000061a8858300000000126800ca0000000061a8858300000000125518b8

Can you provide a statedump of all clients (i.e. all mount points and glustershd processes) ?

The statedump is triggered by sending a USR1 signal to the gluster process (glusterfs):

# kill -USR1 <pid>

The statedump is a file created inside /var/run/gluster/ (if this directory doesn't exist, it needs to be created before sending the signal). For client processes, the file is named as glusterdump.<pid>.dump.<timestamp>

This will tell us if all clients are correctly connected to all bricks, among other things.

Would it be possible to upgrade at least to version 9 ?

I have a large number of instances in use and cannot upgrade all currently or have a variety of versions at the same time. If absolutely needed I could upgrade this particular stack to version 9 to fix the issues, but I'd need to return it back to version 7 before putting it back into production. Would there be an issue with such a downgrade?

Downgrading to older versions is not tested and I wouldn't do that if these servers need to return to production.

xhernandez commented 2 years ago

You can use this script to remove files directly from the brick:

gfrm.sh.gz

The script needs the root directory of the brick as the first argument. Then you can pass paths relative to the root of the volume as additional arguments or through a pipe.

The script will do several sanity checks before actually trying to remove the entries. By default it runs in "dry mode", so instead of deleting anything, it only shows the commands that it would execute.

To run the script in "delete mode", use something like this:

# GFRM_DELETE=1 ./gfrm.sh /shared/.brick /logs/QMNF0ES213309C2/J224342T66211349/results.json.br

Be sure to do the same on all bricks to avoid adding more inconsistencies. Using this tool may cause significant damage if used incorrectly. In case of doubts, please ask.

dalwise commented 2 years ago

I don't understand what's happening here. Based on the monitoring you did on the number of inodes on hostnam20, it's impossible that that brick has less than 700.000 entries. When it was working slow it was creating 900 inodes per minute, then it peaked to 50000 per minute for some time. This means several millions after few days.

What removed those entries ? can you see some activity in the self-heal log after the heal was supposedly completed ?

I never stopped the inode count collection, so I can plot what the creation rate continued to be:

The dips seem aligned with the efforts to remove logs.bak2. During that same time it's possible that additional files got removed on the hostname20 brick, but I'm not sure how that would have happened. As far as I know I've been operating on the mounted volume (so changes should have been equal on all bricks) and nobody else has been using these nodes. Just a wild guess here, but since the logs.bak2 files were copies of the files in logs they likely would have had the same hash (and possibly same GTID?). I don't know if in the unstable recovery situation that could have led to also removing files from logs in addition to logs.bak2 since they had the same hash values.

Can you provide a statedump of all clients (i.e. all mount points and glustershd processes) ?

Sure, I gathered them with kill -USR1 $(pgrep -f glusterfs), let me know if there are any other processes that I may need. I've done some minimal post-processing on them to change the volume and host names respectively to "vol_name" and "host_name" gluster_statedumps.tar.gz

Be sure to do the same on all bricks to avoid adding more inconsistencies. Using this tool may cause significant damage if used incorrectly. In case of doubts, please ask.

The files to remove are only present on one brick (either on hostname21 or hostname22). Should I also run the tool on hostname20 since there shouldn't be anything to remove there? Should I run the same list of files when running on hostname21 and hostname22 or should I only run on each with the list of files that the host actually has?

Thank you very much for your continued support!

xhernandez commented 2 years ago

I don't understand what's happening here. Based on the monitoring you did on the number of inodes on hostnam20, it's impossible that that brick has less than 700.000 entries. When it was working slow it was creating 900 inodes per minute, then it peaked to 50000 per minute for some time. This means several millions after few days.

What removed those entries ? can you see some activity in the self-heal log after the heal was supposedly completed ?

I never stopped the inode count collection, so I can plot what the creation rate continued to be:

The dips seem aligned with the efforts to remove logs.bak2. During that same time it's possible that additional files got removed on the hostname20 brick, but I'm not sure how that would have happened. As far as I know I've been operating on the mounted volume (so changes should have been equal on all bricks) and nobody else has been using these nodes. Just a wild guess here, but since the logs.bak2 files were copies of the files in logs they likely would have had the same hash (and possibly same GTID?). I don't know if in the unstable recovery situation that could have led to also removing files from logs in addition to logs.bak2 since they had the same hash values.

The GFID of each file is created randomly when the file or directory is created, and even if it's a copy, they will have different gfid. Everything else depends on this GFID (it's like the inode number), so logs.bak2 and logs should be completely independent.

Did you a regular copy or there were directory renames involved ? the only issue I can think of that could explain some kind of cross linked directories needs to involve renames of directories.

Can you check the gfid (trusted.gfid xattr) of the logs.bak2 directory and the directory from where you copied it on all bricks ? they should be consistent on all bricks and different from each other. Also check if the symlink corresponding to the gfid points back to the real directory name (the symlink is located at .glusterfs/xx/yy/, where xx and yy are the initial 4 digits of the gfid).

Can you provide a statedump of all clients (i.e. all mount points and glustershd processes) ?

Sure, I gathered them with kill -USR1 $(pgrep -f glusterfs), let me know if there are any other processes that I may need. I've done some minimal post-processing on them to change the volume and host names respectively to "vol_name" and "host_name" gluster_statedumps.tar.gz

The only suspicious thing I see is that client on hostname20 still keeps entries related to logs.bak2 in its cache (I guess you did the rm from that client). Given that the entries have been deleted, they shouldn't be present in cache.

In your version io-cache is enabled by default, but on newer versions we have disabled it by default because it doesn't provide much benefit except for some very specific workloads. Additionally, caches are not immediately synchronized, which means that it could cause issues on workloads where multiple clients access the same files/directories.

I think you have some concurrent access to files and directories. If possible, test how it works by disabling io-cache and readdir-ahead, which cache some data that could lead to incosistencies if other clients check the same data soon enough:

# gluster volume set <volname> io-cache off
# gluster volume set <volname> dirread-ahead off

To get rid of the cached info, I would also umount all clients, stop the volume (only glusterd process should remain running in each server) and restart all again.

Be sure to do the same on all bricks to avoid adding more inconsistencies. Using this tool may cause significant damage if used incorrectly. In case of doubts, please ask.

The files to remove are only present on one brick (either on hostname21 or hostname22). Should I also run the tool on hostname20 since there shouldn't be anything to remove there? Should I run the same list of files when running on hostname21 and hostname22 or should I only run on each with the list of files that the host actually has?

The script checks the real path of the file, so if it's not present in one brick, it won't do anything for that file. However you are right that it's not necessary to run the script for files that you already know they are not present. You can do as you prefer. In any case, if you run the script without GFRM_DELETE it will just show you the list of operation it would run. You can check if there's anything unexpected in that list before running it in delete mode.

I fear that we'll need to heal again hostname20. Otherwise it may not be stable in case another of the bricks is brought down. Once everything is fine, you should start a full heal again:

# gluster volume heal <volname> full

Let it run for some time (heal count should increase), and then you can start other self-heal processes in the servers where heal-count increases if the speed is slow.

dalwise commented 2 years ago

Did you a regular copy or there were directory renames involved ? the only issue I can think of that could explain some kind of cross linked directories needs to involve renames of directories.

Yes, just a regular rsync copy and no renames as far as I recall.

Can you check the gfid (trusted.gfid xattr) of the logs.bak2 directory and the directory from where you copied it on all bricks ?

I can't since on March 2nd I succeeded in deleting logs.bak2, so I no longer have the directory. Before starting the repair procedures I had backed up the brick directories on hostname21 & hostname22, so I tried to check there, but the directories don't have xattrs. I likely used rsync which by default does not copy the xattrs.

The only suspicious thing I see is that client on hostname20 still keeps entries related to logs.bak2 in its cache (I guess you did the rm from that client)

Thanks for looking into it! However note I was doing the rm from hostname21

I think you have some concurrent access to files and directories. If possible, test how it works by disabling io-cache and readdir-ahead, which cache some data that could lead to incosistencies if other clients check the same data soon enough:
# gluster volume set <volname> io-cache off
# gluster volume set <volname> dirread-ahead off

I was able to run the first one, but needed to correct the second one per the prompt:

# gluster volume set vol_name io-cache off
volume set: success
# gluster volume set vol_name dirread-ahead off
volume set: failed: option : dirread-ahead does not exist
Did you mean read-ahead?
# gluster volume set vol_name read-ahead off
volume set: success

To get rid of the cached info, I would also umount all clients, stop the volume (only glusterd process should remain running in each server) and restart all again.

I ran gluster volume stop vol_name. After that I was left with a glusterd and a glusterfs process running in each node. I killed the glusterfs process in each node manually. Before starting the volume again I'm running gfrm.sh so that I am sure that there are no changes in the mounted volume at the same time.

When running gfrm.sh I'm getting a number of "GFID doesn't exist" errors. For example:

[hostname21 ~]$ ./gfrm.sh /shared/.brick logs/TWHT110505056/J189127T52627671
logs/TWHT110505056/J189127T52627671: GFID doesn't exist

Indeed that directory doesn't seem to have an entry in .glusterfs:

[hostname21 ~]$ stat /shared/.brick/logs/TWHT110505056/J189127T52627671/
  File: ‘/shared/.brick/logs/TWHT110505056/J189127T52627671/’
  Size: 27              Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 633971745   Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-03-07 21:27:26.925164633 +0000
Modify: 2021-06-03 05:54:42.381314026 +0000
Change: 2021-12-17 16:51:28.823420402 +0000
 Birth: -
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs/TWHT110505056/J189127T52627671/
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs/TWHT110505056/J189127T52627671/
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0x97e0b18ae31c40099792b3478824d552
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x0100000000000000000000000060b86ea20000000016822f580000000060b86ea20000000016822f580000000060b7ab9b00000000320c2ea2
[hostname21 ~]$ stat /shared/.brick/.glusterfs/97/e0/97e0b18a-e31c-4009-9792-b3478824d552
stat: cannot stat ‘/shared/.brick/.glusterfs/97/e0/97e0b18a-e31c-4009-9792-b3478824d552’: No such file or directory

These are always directories. A look at the script seems to show that in this case the directory will not get cleaned. In these cases can I just remove it from the brick with rmdir (add to the script rmdir "${entry}" right after the echo "${1}: GFID doesn't exist")?

xhernandez commented 2 years ago

Did you a regular copy or there were directory renames involved ? the only issue I can think of that could explain some kind of cross linked directories needs to involve renames of directories.

Yes, just a regular rsync copy and no renames as far as I recall.

Can you check the gfid (trusted.gfid xattr) of the logs.bak2 directory and the directory from where you copied it on all bricks ?

I can't since on March 2nd I succeeded in deleting logs.bak2, so I no longer have the directory. Before starting the repair procedures I had backed up the brick directories on hostname21 & hostname22, so I tried to check there, but the directories don't have xattrs. I likely used rsync which by default does not copy the xattrs.

We have the xattrs from logs.bak2 from this comment: https://github.com/gluster/glusterfs/issues/3072#issuecomment-1042370864.

You should check if the gfid is the same as the source directory of the copy.

The only issue will be to see if the symlink was pointing to the right directory, but if didn't rename that directory, I don't think there was an issue.

The only suspicious thing I see is that client on hostname20 still keeps entries related to logs.bak2 in its cache (I guess you did the rm from that client)

Thanks for looking into it! However note I was doing the rm from hostname21

In that case the data on hostname20 was not good. Better to get rid of it.

I think you have some concurrent access to files and directories. If possible, test how it works by disabling io-cache and readdir-ahead, which cache some data that could lead to incosistencies if other clients check the same data soon enough:
# gluster volume set <volname> io-cache off
# gluster volume set <volname> dirread-ahead off
I was able to run the first one, but needed to correct the second one per the prompt:
# gluster volume set vol_name io-cache off
volume set: success
# gluster volume set vol_name dirread-ahead off
volume set: failed: option : dirread-ahead does not exist
Did you mean read-ahead?
# gluster volume set vol_name read-ahead off
volume set: success

Sorry. It was a typo. The correct command is:

# gluster volume set <volname> readdir-ahead off

read-ahead is for data files and may or may not help in your case. For normal FUSE mounts the kernel already does some read-ahead, so using gluster's read ahead could cause big read amplifications in some cases, but as far as I know, there's no bug related to it, so you can keep it enable if it works better for you.

To get rid of the cached info, I would also umount all clients, stop the volume (only glusterd process should remain running in each server) and restart all again.

I ran gluster volume stop vol_name. After that I was left with a glusterd and a glusterfs process running in each node. I killed the glusterfs process in each node manually. Before starting the volume again I'm running gfrm.sh so that I am sure that there are no changes in the mounted volume at the same time.

When running gfrm.sh I'm getting a number of "GFID doesn't exist" errors. For example:
[hostname21 ~]$ ./gfrm.sh /shared/.brick logs/TWHT110505056/J189127T52627671
logs/TWHT110505056/J189127T52627671: GFID doesn't exist
Indeed that directory doesn't seem to have an entry in .glusterfs:
[hostname21 ~]$ stat /shared/.brick/logs/TWHT110505056/J189127T52627671/
  File: ‘/shared/.brick/logs/TWHT110505056/J189127T52627671/’
  Size: 27              Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 633971745   Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-03-07 21:27:26.925164633 +0000
Modify: 2021-06-03 05:54:42.381314026 +0000
Change: 2021-12-17 16:51:28.823420402 +0000
 Birth: -
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs/TWHT110505056/J189127T52627671/
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs/TWHT110505056/J189127T52627671/
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0x97e0b18ae31c40099792b3478824d552
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x0100000000000000000000000060b86ea20000000016822f580000000060b86ea20000000016822f580000000060b7ab9b00000000320c2ea2
[hostname21 ~]$ stat /shared/.brick/.glusterfs/97/e0/97e0b18a-e31c-4009-9792-b3478824d552
stat: cannot stat ‘/shared/.brick/.glusterfs/97/e0/97e0b18a-e31c-4009-9792-b3478824d552’: No such file or directory
These are always directories. A look at the script seems to show that in this case the directory will not get cleaned. In these cases can I just remove it from the brick with rmdir (add to the script rmdir "${entry}" right after the echo "${1}: GFID doesn't exist")?

Yes, you can. The script tries to be very strict, so unless everything is fine, it doesn't touch anything. If the gfid is not present, removing the directory is fine. However, if the directory is not empty, first you need to remove all its entries using the script to also remove the related gfid's.

The missing gfid is suspicious. It's weird that it happens in several cases. If it only happens with directories, could it be related to the issue you had when you ran out of inodes during a backup ?

dalwise commented 2 years ago

We have the xattrs from logs.bak2 from this comment: https://github.com/gluster/glusterfs/issues/3072#issuecomment-1042370864.

Great that you remembered that!

You should check if the gfid is the same as the source directory of the copy.

It is different:

[hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs
trusted.gfid=0xb05a24a1dfcb4157b3ff61524561484c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000
trusted.glusterfs.mdata=0x0100000000000000000000000061b38940000000000a1cbbbe0000000000000000000000000000000000000000000000000000000000000000

[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0xb05a24a1dfcb4157b3ff61524561484c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000
trusted.glusterfs.mdata=0x0100000000000000000000000061b38940000000000a1cbbbe0000000000000000000000000000000000000000000000000000000000000000

[hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0xb05a24a1dfcb4157b3ff61524561484c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000
trusted.glusterfs.mdata=0x0100000000000000000000000061b38940000000000a1cbbbe0000000000000000000000000000000000000000000000000000000000000000

In that case the data on hostname20 was not good. Better to get rid of it.

By "get rid of it" you mean making the cache setting changes that we are discussing or something else?

Sorry. It was a typo. The correct command is:
# gluster volume set <volname> readdir-ahead off

Thanks for clarifying. I reverted read-ahead to on and disabled readdir-ahead.

Yes, you can. The script tries to be very strict, so unless everything is fine, it doesn't touch anything. If the gfid is not present, removing the directory is fine. However, if the directory is not empty, first you need to remove all its entries using the script to also remove the related gfid's.

The script is indeed strict. I had to run it through the list of files a few times, since as you said initially it would complain that directories where not (yet) empty. But the important thing is that after running it a few times and then doing the final rmdir cleanup the file lists on hostname21 and hostname22 are now identical :). For reference, currently:

[hostname21 ~]$ wc -l gluster_files_ordered.txt
112938316 gluster_files_ordered.txt

I've now restarted the volume and launched the full heal operation. Hopefully that recreates the entries on hostname20. So far the heal count remains at 0. I'm also monitoring the inode creation rate. It's been running for a few hours and so far there has been a minimal increase in inode count.

The missing gfid is suspicious. It's weird that it happens in several cases. If it only happens with directories, could it be related to the issue you had when you ran out of inodes during a backup ?

That may well be the case for hostname21, where there were hundreds of missing GFIDs. On hostname22 I had two missing GFIDs even though it never ran out of inodes.

xhernandez commented 2 years ago

In that case the data on hostname20 was not good. Better to get rid of it.

By "get rid of it" you mean making the cache setting changes that we are discussing or something else?

I meant the cached data, so that the client couldn't use the stale data for anything. Restarting the processes as you did is enough. However, I would keep those caching xlators disabled for now (io-cache and readdir-ahead).

Sorry. It was a typo. The correct command is:
# gluster volume set <volname> readdir-ahead off
Thanks for clarifying. I reverted read-ahead to on and disabled readdir-ahead.

Yes, you can. The script tries to be very strict, so unless everything is fine, it doesn't touch anything. If the gfid is not present, removing the directory is fine. However, if the directory is not empty, first you need to remove all its entries using the script to also remove the related gfid's.

The script is indeed strict. I had to run it through the list of files a few times, since as you said initially it would complain that directories where not (yet) empty. But the important thing is that after running it a few times and then doing the final rmdir cleanup the file lists on hostname21 and hostname22 are now identical :). For reference, currently:
[hostname21 ~]$ wc -l gluster_files_ordered.txt
112938316 gluster_files_ordered.txt

Great :)

I've now restarted the volume and launched the full heal operation. Hopefully that recreates the entries on hostname20. So far the heal count remains at 0. I'm also monitoring the inode creation rate. It's been running for a few hours and so far there has been a minimal increase in inode count.

It may take some time till it arrives to the directories that have missing entries (it's doing a full scan of the entire volume in no particular order). If you want you can check the self-heal log. It won't report any activity if no issues are found but you should see that the "full sweep" has started and whether any errors has happened.

The missing gfid is suspicious. It's weird that it happens in several cases. If it only happens with directories, could it be related to the issue you had when you ran out of inodes during a backup ?

That may well be the case for hostname21, where there were hundreds of missing GFIDs. On hostname22 I had two missing GFIDs even though it never ran out of inodes.

Let's see if self-heal completes successfully this time and data is correctly replicated.

dalwise commented 2 years ago

Hi @xhernandez,

It may take some time till it arrives to the directories that have missing entries (it's doing a full scan of the entire volume in no particular order). If you want you can check the self-heal log. It won't report any activity if no issues are found but you should see that the "full sweep" has started and whether any errors has happened.

Sorry for the delay to get back to you. I was waiting for the sweep to complete and I expected it to take long as you had warned. However I just noticed that it completed soon after launching it and I had missed it earlier.

The log shows a good number of errors related to "client-common.c:2705:client_pre_getxattr_v2" and files on the brick of hostname20 remain at the same number:

[hostname20 .brick]$ find . -path ./.glusterfs -prune -o -print | wc -l
649105

Here is the heal log for reference: glustershd.log-20220309.gz

Do you have any suggestion on how to unblock the heal?

Thanks!

dalwise commented 2 years ago

Hi @xhernandez, just a quick ping in case you have any further suggestions on how to get the heal to proceed, as it seems blocked at this point.

hostname20 continues to have much fewer inodes used:

[root@hostname20 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
652148

Whereas hostname21 & hostname22 continue in sync at a much higher number:

[root@hostname21 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
112938373

[root@hostname22 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
112938373

I tried another full heal, but it showed the same type of errors as in my last message and did not change the used inodes on hostname20.

Thanks for all your help!