gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.73k stars 1.08k forks source link

Heal count has been increasing for days and currently at over 34 million #3072

Open dalwise opened 2 years ago

dalwise commented 2 years ago

Description of problem: Heal count keeps increasing. Current status after 13 days of increases is:

# gluster volume heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful

Brick host_name21:/shared/.brick
Number of entries: 34410002

Brick host_name22:/shared/.brick
Number of entries: 34363886

Brick host_name20:/shared/.brick
Number of entries: 0

The exact command to reproduce the issue: The issue happened after trying to recover from an inaccessible directory (/shared/vol_name/logs) within the volume mounted at /shared/vol_name. Trying to access this directory would return "Transport endpoint not connected" on all clients. Other directories in the mounted volume were not affected.

gluster volume heal showed GFIDs in need of healing and the problematic directory in split brain. We were not able to get the files healed by using the gluster heal commands. We were then able to resolve problems with most GFIDs by removing the corresponding files in the bricks. However one GFID remained in need of healing on host_name20 and we could not determine what file it corresponded to. Since host_name20 just had the arbiter brick we tried removing it:

gluster volume remove-brick vol_name replica 2 host_name20:/shared/.brick force
gluster peer detach host_name20

That allowed us to access the directory that we could not see earlier. We then attempted to rejoin the arbiter with a clean brick:

rm -rf /shared/.brick/*  # Ran this on host_name20
gluster peer probe host_name20
gluster volume add-brick vol_name replica 3 arbiter 1 host_name20:/shared/.brick force

The directory is still accessible, but the number of files in need of healing has been increasing for the last 13 days.

We can likely recover by simply backing up the files, destroying the current volume and then moving the files onto a newly created volume. However we are at a loss as to:

The full output of the command that failed: (heal count command above)

Expected results: Heal count to return to 0.

Mandatory info: - The output of the gluster volume info command:

Volume Name: vol_name
Type: Replicate
Volume ID: 4833820f-4518-4a2e-a3d8-63d6f31c4646
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: host_name21:/shared/.brick
Brick2: host_name22:/shared/.brick
Brick3: host_name20:/shared/.brick (arbiter)
Options Reconfigured:
cluster.self-heal-window-size: 2
cluster.shd-wait-qlength: 2048
disperse.shd-wait-qlength: 2048
cluster.shd-max-threads: 8
cluster.self-heal-daemon: enable
transport.address-family: inet6
nfs.disable: on
performance.client-io-threads: off

- The output of the gluster volume status command:

Status of volume: vol_name
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick host_name21:/shared/.brick                  49153     0          Y       55517
Brick host_name22:/shared/.brick                  49153     0          Y       60313
Brick host_name20:/shared/.brick                  49152     0          Y       59555
Self-heal Daemon on localhost               N/A       N/A        Y       58347
Self-heal Daemon on host_name22                   N/A       N/A        Y       89514
Self-heal Daemon on host_name21                   N/A       N/A        Y       57650

Task Status of Volume vol_name
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command: (Shared heal count above as heal contains tens of millions of entries)

- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

https://www.dropbox.com/s/p93wyyztj5bmzk8/logs.tgz

This compressed log file contains the logs for all three server nodes. The server nodes have the volume mounted and so are acting as clients also.

The volume name is the name of an internal project and has been changed to "vol_name" in command outputs and logs. The hostnames are also internal and have been changed to host_name20, host_name21 & host_name22.

**- Is there any crash ? Provide the backtrace and coredump No crash

Additional info:

- The operating system / glusterfs version:

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

xhernandez commented 2 years ago

Hi @dalwise. I completely missed your previous update. I'm sorry.

Can you run this script on hostnam20 ?

#!/bin/bash

set -eEu

BRICK="${1:?You must pass the path to the root of the brick}"
BRICK="$(realpath "${BRICK}")"

if [[ ! -d "${BRICK}/.glusterfs" ]]; then
    echo "'${BRICK}' doesn't seem to contain a brick" >&2
    exit 1
fi

declare -A GFIDS

function resolve() {
    local gfid="${1}"
    local link ref

    link="${GFIDS[${gfid}]-}"
    if [[ -z "${link}" ]]; then
        echo "${gfid} doesn't exist" >&2
        GFIDS[${gfid}]="<missing>/"
    elif [[ "${link}" == "../../.." ]]; then
        GFIDS[${gfid}]="/"
    elif [[ "${link:0:6}" == "../../" ]]; then
        ref="${link:12:36}"
        resolve "${ref}"
        GFIDS[${gfid}]="${GFIDS[${ref}]}${link:49}/"
    fi
}

while read gfid link; do
    GFIDS[${gfid}]="${link}"
done < <(find "${BRICK}/.glusterfs" -type l -links 1 -printf "%f %l\n")

for gfid in "${!GFIDS[@]}"; do
    resolve "${gfid}"
done

len="${#BRICK}"

while read gfid path; do
    gfid="${gfid:0:8}-${gfid:8:4}-${gfid:12:4}-${gfid:16:4}-${gfid:20}"
    path="${path:${len}}"
    if [[ -z "${GFIDS[${gfid}]-}" ]]; then
        echo "Directory without GFID (${gfid}): '${path}'" >&2
    else
        if [[ "${GFIDS[${gfid}]}" != "${path}" ]]; then
            echo "Mismatching directory (${gfid}): '${path}' <-> '${GFIDS[${gfid}]}'" >&2
        fi
        unset GFIDS[${gfid}]
    fi
done < <(find "${BRICK}" -path "${BRICK}/.glusterfs" -prune -o -type d -exec getfattr -e hex -n trusted.gfid --absolute-names {} \; |
             sed -n '/^#\s*file\s*:/{N;s/^#\s*file\s*:\s*\(.*\)\ntrusted\.gfid\s*=\s*0x\(.*\)/\2 \1\//p}')

for gfid in "${!GFIDS[@]}"; do
    echo "Orphan GFID (${gfid}): '${GFIDS[${gfid}]}'" >&2
done

To run it, just pass the root directory of the brick. It will check if all directories are correctly defined. The assertion could be caused by a directory without its corresponding gfid.

dalwise commented 2 years ago

Thanks for the feedback @xhernandez!

The script did find many issues needing correction. I stored the output on check_gluster_dirs.log, which has 345235 lines:

[hostname20 ~]$ wc -l check_gluster_dirs.log
345235 check_gluster_dirs.log

They fall into the following categories:

[hostname20 ~]$ grep -c "doesn't exist" check_gluster_dirs.log
1550
[hostname20 ~]$ grep -c "Directory without GFID" check_gluster_dirs.log
339250
[hostname20 ~]$ grep -c "Mismatching directory" check_gluster_dirs.log
4435
[hostname20 ~]$

Is there any automated way to fix these?

Best regards, Daniel

xhernandez commented 2 years ago

I wasn't expecting so many errors. Can you run the script on another brick that should be ok to be sure that there isn't any bug in the script ?

You can also select some of the errors and manually verify that they are correct. If data is fine, then this is what you should do:

First of all you should disable self-heal to prevent unexpected interferences while you are touching backend contents, specially with so many errors:

# gluster volume set <volname> self-heal-daemon off

Then check the errors:

ln -s ../../${parent_gfid:0:2}/${parent_gfid:2:2}/${parent_gfid}/${dir_name} .glusterfs/${gfid:0:2}/${gfid:2:2}/${gfid}

"parent_gfid" is the GFID of the parent directory. "dir_name" is the base name of the directory, and "gfid" is the GFID of the directory.

Both option can be automated with a script if necessary.

If you fix all the issues, run the script again to verify that everything is correct before restarting self-heal.

There's another possibility: given that the number of missing entries is huge compared to the existing ones, and that half of the existing ones are already damaged, maybe it will be easier to just remove everything from the arbiter brick and do a full heal. If I'm not wrong this is basically what you already did at the beginning, so arbiter brick should be healthy. Since that's not the case, before doing anything verify that the other two bricks don't have any issue with directories (running the script) that could cause issues with self-heal.

If you decide to go this way, let me know to tell you exactly what to remove and how to start self-heal.

dalwise commented 2 years ago

Hi @xhernandez ,

There's another possibility [...] before doing anything verify that the other two bricks don't have any issue with directories (running the script) that could cause issues with self-heal.

I have been running the script on hostname21 & hostname22 to evaluate the extent of the problem. Running on hostname21 took over a day and yielded:

[hostname21 ~]$ wc -l check_gluster_dirs.log
29479698 check_gluster_dirs.log
[hostname21 ~]$ grep -c "Directory without GFID" check_gluster_dirs.log
29479692
[hostname21 ~]$ grep -c "Orphan GFID" check_gluster_dirs.log
6

The run on hostname22 is still ongoing and has 26354180 lines so far. I'll report back when it completes, but it does look like there are considerable inconsistencies on all bricks.

Best regards

xhernandez commented 2 years ago

Can you provide some examples of these inconsistencies ? also provide stat and getfattr -m. -e hex -d -h of them, if possible.

dalwise commented 2 years ago

Sure!

hostname22 completed running the script you provided and has the same issues found as hostname21.

Here are some samples of directories without GFID in hostname21 & hostname22:

Directory without GFID (daa23852-c221-4c5f-803b-3e3d5678046e): '/logs/TWHT192909365/J2882T706383/'
Directory without GFID (1b8ce690-2eed-4a60-a49b-eb9a5f98cf35): '/logs/TWHT192907399/J195T22180/'
Directory without GFID (672e7b8d-d227-4198-9965-378faafd291c): '/logs/TWHT192907399/J195T30029/'
Directory without GFID (c585c9b7-b551-4135-b537-e32bc9ec9a94): '/logs/TWHT192907399/J1292T258174/'
Directory without GFID (2c12b41f-9b88-4d4c-b2c6-d19b3fd2b703): '/logs/TWHT192805307/J172T30005/'
Directory without GFID (7e59d86d-e004-44c7-b923-3fe888517bb8): '/logs/TWHT192801241/J178T22865/'
Directory without GFID (4acfb407-d5fc-4ae4-bb73-f37eee3d26d4): '/logs/TWHT192801241/J178T22953/'
Directory without GFID (98cfad02-8d42-4cfd-b504-a6c59e128eb4): '/logs/TWHT192803296/J262T33613/'
Directory without GFID (74c751f8-e8f4-41e7-aea8-039d276a7f32): '/logs/TWHT192803296/J262T35029/'
Directory without GFID (38e1167a-154b-4a72-aa7f-8675a8f95d52): '/logs/TWHT192812148/J3689T881252/'

For the first item in the list above these are the stat and getfattr results on each node:

[hostname20 .brick]# stat logs/TWHT192909365/J2882T706383/
stat: cannot stat ‘logs/TWHT192909365/J2882T706383/’: No such file or directory
[hostname21 .brick]$ stat logs/TWHT192909365/J2882T706383/
  File: ‘logs/TWHT192909365/J2882T706383/’
  Size: 130           Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 677676032   Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-05-10 18:58:19.266364954 +0000
Modify: 2019-11-17 02:23:14.332844902 +0000
Change: 2022-02-02 19:57:17.643823391 +0000
 Birth: -
[hostname21 .brick]$ getfattr -m. -e hex -d -h logs/TWHT192909365/J2882T706383/
# file: logs/TWHT192909365/J2882T706383/
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0xdaa23852c2214c5f803b3e3d5678046e
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000
trusted.glusterfs.mdata=0x010000000000000000000000005dd0af120000000013d6cf66000000005dd0af120000000013d6cf66000000005eebc45c00000000286deaaf
[hostname22 .brick]$ stat logs/TWHT192909365/J2882T706383/
  File: ‘logs/TWHT192909365/J2882T706383/’
  Size: 130           Blocks: 8          IO Block: 4096   directory
Device: fd00h/64768d    Inode: 677681199   Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-05-11 01:47:35.786147664 +0000
Modify: 2019-11-17 02:23:14.330110441 +0000
Change: 2022-02-07 20:16:36.321719075 +0000
 Birth: -
[hostname22 .brick]$ getfattr -m. -e hex -d -h logs/TWHT192909365/J2882T706383/
# file: logs/TWHT192909365/J2882T706383/
trusted.afr.vol_name-client-0=0x000000000000000100000001
trusted.afr.vol_name-client-2=0x000000000000000000000000
trusted.gfid=0xdaa23852c2214c5f803b3e3d5678046e
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000
trusted.glusterfs.mdata=0x010000000000000000000000005dd0af120000000013d6cf66000000005dd0af120000000013d6cf66000000005eebc45c00000000286deaaf

[hostname22 .brick]$
xhernandez commented 2 years ago

@dalwise can you check if .glusterfs/da/a2/daa23852-c221-4c5f-803b-3e3d5678046e exists on any brick and run stat on it if so ?

The entry on hostname22 indicates pending changes on another brick, is there nothing in gluster volume heal <volname> info ?

Can you also check the contents of .glusterfs/indices/* (there should be 3 subdirectories) on all bricks ? if there's something, is daa23852-c221-4c5f-803b-3e3d5678046e there ?

dalwise commented 2 years ago

@dalwise can you check if .glusterfs/da/a2/daa23852-c221-4c5f-803b-3e3d5678046e exists on any brick and run stat on it if so ?

It doesn't exist on any of the 3 bricks:

[hostname20 .brick]$ ls /shared/.brick/.glusterfs/da/a2/daa23852-c221-4c5f-803b-3e3d5678046e
ls: cannot access /shared/.brick/.glusterfs/da/a2/daa23852-c221-4c5f-803b-3e3d5678046e: No such file or directory
[hostname21 .brick]$ ls /shared/.brick/.glusterfs/da/a2/daa23852-c221-4c5f-803b-3e3d5678046e
ls: cannot access /shared/.brick/.glusterfs/da/a2/daa23852-c221-4c5f-803b-3e3d5678046e: No such file or directory
[hostname22 .brick]$ ls /shared/.brick/.glusterfs/da/a2/daa23852-c221-4c5f-803b-3e3d5678046e
ls: cannot access /shared/.brick/.glusterfs/da/a2/daa23852-c221-4c5f-803b-3e3d5678046e: No such file or directory

The entry on hostname22 indicates pending changes on another brick, is there nothing in gluster volume heal info ?

There isn't:

[hostname20 .brick]$ gluster volume heal vol_name info
Brick hostname21:/shared/.brick
Status: Connected
Number of entries: 0

Brick hostname22:/shared/.brick
Status: Connected
Number of entries: 0

Brick hostname20:/shared/.brick
Status: Connected
Number of entries: 0

Can you also check the contents of .glusterfs/indices/* (there should be 3 subdirectories) on all bricks ? if there's something, is daa23852-c221-4c5f-803b-3e3d5678046e there ?

It's not there:

[hostname20 .brick]$ ls .glusterfs/indices/*
.glusterfs/indices/dirty:
dirty-07fcd31e-4e46-48af-ac73-7609ea647fde

.glusterfs/indices/entry-changes:

.glusterfs/indices/xattrop:
xattrop-07fcd31e-4e46-48af-ac73-7609ea647fde
[hostname21 .brick]$ ls .glusterfs/indices/*
.glusterfs/indices/dirty:
dirty-f739c46e-c0b8-4dd4-9a28-315c63aa7b81

.glusterfs/indices/entry-changes:

.glusterfs/indices/xattrop:
xattrop-f739c46e-c0b8-4dd4-9a28-315c63aa7b81
[hostname21 .brick]$
[hostname22 .brick]$ ls .glusterfs/indices/*
.glusterfs/indices/dirty:
dirty-4f6e0bfc-c48a-47cf-a9eb-eae609802bd7

.glusterfs/indices/entry-changes:

.glusterfs/indices/xattrop:
xattrop-4f6e0bfc-c48a-47cf-a9eb-eae609802bd7
[hostname22 .brick]$

Thank you very much for all your help.

xhernandez commented 2 years ago

I'm preparing a tool to do a full check of the bricks, I'll need some more time...

dalwise commented 2 years ago

Thank you again for your help on this!

xhernandez commented 2 years ago

Hi @dalwise. I'm very sorry for the delay. I was trying to create a generic tool that could read the data as fast as possible, but its complexity and the other work I have makes it hard. I'll provide a simpler tool soon.

stale[bot] commented 1 year ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

dalwise commented 1 year ago

Hi @xhernandez, we still have not been able to put the systems that had this issue back into production. Do you have any updates on the tool you had mentioned to do a full check of the bricks?

Thank you very much!