gluster volume heal <volume> info shows a long list of "gfid"

NanyinK commented 3 years ago

Description of problem: I ran gluster volume heal info it showed a long list of "gfid", such as

[ Brick A:/data/brick1/gv0 Status: Connected Number of entries: 0

Brick B:/data/brick1/gv0

... Status: Connected Number of entries: 366 Brick C ... Status: Connected Number of entries: 366 ] And the cpu usage on the gluster node B is usually nearly full. **The exact command to reproduce the issue**: **The full output of the command that failed**:

**Expected results:**

**Mandatory info:** **- The output of the `gluster volume info` command**: Volume Name: Gluster_v0 Type: Replicate Volume ID: 82372ae7-ccef-4ed3-9f9e-3e192f41598b Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: A:/data/brick1/gv0 Brick2: B:/data/brick1/gv0 Brick3: C:/data/brick1/gv0 (arbiter) Options Reconfigured: storage.build-pgfid: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off performance.open-behind: off **- The output of the `gluster volume status` command**: **- The output of the `gluster volume heal` command**: **- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/ **- Is there any crash ? Provide the backtrace and coredump no crash. **Additional info:**

**- The operating system / glusterfs version**: CentOS Linux release 7.6.1810 gluster 6.10 **Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration**

NanyinK commented 3 years ago

Not sure why the entries is 0 on node A and 366 on B & C.

Could anyone tell me how to cleanup them? seems it's slowing my servers down.

pranithk commented 3 years ago

This means from B,C heals need to happen to A on at least 36 files/directories. Until heals complete, the cluster will be in downgraded state. Is the number not decreasing?

NanyinK commented 3 years ago

Hi @pranithk

Im afraid no, not decreasing. still 366, so Im feeling it strange.

NanyinK commented 3 years ago

Is there any method to manually decrease them?

pranithk commented 3 years ago

@NanyinK Can you post the output of the following command on each of the bricks.

# getfattr -d -m. -e hex <brick-path>/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059

NanyinK commented 3 years ago

@pranithk

sure.

A : getfattr -d -m. -e hex /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059 -- getfattr: /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059: No such file or directory

B : getfattr -d -m. -e hex /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059 -- getfattr: /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059: Too many levels of symbolic links

C : getfattr -d -m. -e hex /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059 -- getfattr: /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059: Too many levels of symbolic links

pranithk commented 3 years ago

@pranithk

sure.

A : getfattr -d -m. -e hex /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059 -- getfattr: /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059: No such file or directory

B : getfattr -d -m. -e hex /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059 -- getfattr: /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059: Too many levels of symbolic links

C : getfattr -d -m. -e hex /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059 -- getfattr: /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059: Too many levels of symbolic links

Could you do readlink -f on the filepath where it gives Too many levels of symbolic links?

pranithk commented 3 years ago

and paste the output

NanyinK commented 3 years ago

Hi @pranithk

A: readlink -f /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059 /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059

B:readlink -f /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059 /data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004

C: readlink -f /data/brick1/gv0/.glusterfs/fe/2d/fe2ddd63-bbbf-4bac-a719-e0998047c059 /data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004

pranithk commented 3 years ago

@NanyinK Can you give output of:

getfattr -d -m. -e hex <brick-path>/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1___2___79/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004`

and

getfattr -d -m. -e hex <brick-path>/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1___2___79/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0`

On all the bricks?

NanyinK commented 3 years ago

Hi @pranithk

getfattr -d -m. -e hex /banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004`

Still nothing on A.

B : getfattr -d -m. -e hex /data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004 getfattr: Removing leading '/' from absolute path names file: data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004 system.posix_acl_access=0x0200000001000700ffffffff04000700ffffffff080007001027000010000700ffffffff20000500ffffffff system.posix_acl_default=0x0200000001000700ffffffff04000700ffffffff080007001027000010000700ffffffff20000500ffffffff trusted.SGI_ACL_DEFAULT=0x0000000500000001ffffffff0007000000000004ffffffff0007000000000008000027100007000000000010ffffffff0007000000000020ffffffff00050000 trusted.SGI_ACL_FILE=0x0000000500000001ffffffff0007000000000004ffffffff0007000000000008000027100007000000000010ffffffff0007000000000020ffffffff00050000 trusted.afr.Gluster_v0-client-0=0x000000000000000100000003 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0xfe2ddd63bbbf4baca719e0998047c059 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x010000000000000000000000005f57195f000000002eb30d49000000005f57195f0000000023833d49000000005f57195f00000000185bd632

C : getfattr -d -m. -e hex /data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004 getfattr: Removing leading '/' from absolute path names file: data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004 system.posix_acl_access=0x0200000001000700ffffffff04000700ffffffff080007001027000010000700ffffffff20000500ffffffff system.posix_acl_default=0x0200000001000700ffffffff04000700ffffffff080007001027000010000700ffffffff20000500ffffffff trusted.SGI_ACL_DEFAULT=0x0000000500000001ffffffff0007000000000004ffffffff0007000000000008000027100007000000000010ffffffff0007000000000020ffffffff00050000 trusted.SGI_ACL_FILE=0x0000000500000001ffffffff0007000000000004ffffffff0007000000000008000027100007000000000010ffffffff0007000000000020ffffffff00050000 trusted.afr.Gluster_v0-client-0=0x000000000000000100000003 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0xfe2ddd63bbbf4baca719e0998047c059 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x010000000000000000000000005f57195f000000002eb30d49000000005f57195f0000000023833d49000000005f57195f00000000185bd632

NanyinK commented 3 years ago

getfattr -d -m. -e hex /banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0`

B : getfattr -d -m. -e hex /data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0 getfattr: Removing leading '/' from absolute path names file: data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0 system.posix_acl_access=0x0200000001000700ffffffff04000700ffffffff080007001027000010000700ffffffff20000500ffffffff system.posix_acl_default=0x0200000001000700ffffffff04000700ffffffff080007001027000010000700ffffffff20000500ffffffff trusted.SGI_ACL_DEFAULT=0x0000000500000001ffffffff0007000000000004ffffffff0007000000000008000027100007000000000010ffffffff0007000000000020ffffffff00050000 trusted.SGI_ACL_FILE=0x0000000500000001ffffffff0007000000000004ffffffff0007000000000008000027100007000000000010ffffffff0007000000000020ffffffff00050000 trusted.afr.Gluster_v0-client-0=0x00000000000000000000005e trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0x94d78463631e4626bb71e6187f2ca6d2 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x010000000000000000000000005f57195f0000000030f565af000000005f57195f0000000030f565af000000005f5718da000000002a115abd

C : getfattr -d -m. -e hex /data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0 getfattr: Removing leading '/' from absolute path names file: data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0 system.posix_acl_access=0x0200000001000700ffffffff04000700ffffffff080007001027000010000700ffffffff20000500ffffffff system.posix_acl_default=0x0200000001000700ffffffff04000700ffffffff080007001027000010000700ffffffff20000500ffffffff trusted.SGI_ACL_DEFAULT=0x0000000500000001ffffffff0007000000000004ffffffff0007000000000008000027100007000000000010ffffffff0007000000000020ffffffff00050000 trusted.SGI_ACL_FILE=0x0000000500000001ffffffff0007000000000004ffffffff0007000000000008000027100007000000000010ffffffff0007000000000020ffffffff00050000 trusted.afr.Gluster_v0-client-0=0x00000000000000000000005e trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0x94d78463631e4626bb71e6187f2ca6d2 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x010000000000000000000000005f57195f0000000030f565af000000005f57195f0000000030f565af000000005f5718da000000002a115abd

pranithk commented 3 years ago

@NanyinK Can you do stat <gluster-mount-point>/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1___2___79/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004

Wait for sometime and check if the directories are created on A after some time.

NanyinK commented 3 years ago

Hi @pranithk do this on which one, A or all?

pranithk commented 3 years ago

Hi @pranithk do this on which one, A or all?

On any of the clients with gluster-mount-point as mentioned in the previous comment.

NanyinK commented 3 years ago

Hi @pranithk

15 minutes have passed but nothing happened on A.

the file is still not there.

pranithk commented 3 years ago

@NanyinK The directory is not created?

NanyinK commented 3 years ago

@pranithk yep

NanyinK commented 3 years ago

stat /data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004 stat: cannot stat ‘/data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935/1279/tmp_var.parquet/CALC_TYPE_SUBFOLDER=PERC_CHG/_temporary/0/task_20200908154047_24936_m_000004’: No such file or directory

pranithk commented 3 years ago

@NanyinK Could you find what is the last directory in the hierarchy which exists on A? Could you get the getfattr output from the bricks of that directory?

pranithk commented 3 years ago

I meant, if a/b/c/d is the path and on A if a exists but b/c/d doesn't exist, let me know the getfattr output for 'a' on all the bricks

NanyinK commented 3 years ago

Hi @pranithk

they are all the same

getfattr -d -m. -e hex /data/brick1/gv0 getfattr: Removing leading '/' from absolute path names file: data/brick1/gv0 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.mdata=0x01000000000000000000000000607659b90000000024e34cfb00000000607659b90000000024e34cfb000000005cb9ce490000000037f5b669 trusted.glusterfs.volume-id=0x82372ae7ccef4ed39f9e3e192f41598b

NanyinK commented 3 years ago

Hi @pranithk

I find where starts to be different.

on all the bricks, the outputs of this command are the same

getfattr -d -m. -e hex /data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935 getfattr: Removing leading '/' from absolute path names file: data/brick1/gv0/banking/imb/udiscover/data/appdata/OUTPUT_TRAINING/20200908141935 system.posix_acl_access=0x0200000001000700ffffffff04000700ffffffff080007001027000010000700ffffffff20000500ffffffff system.posix_acl_default=0x0200000001000700ffffffff04000700ffffffff080007001027000010000700ffffffff20000500ffffffff trusted.SGI_ACL_DEFAULT=0x0000000500000001ffffffff0007000000000004ffffffff0007000000000008000027100007000000000010ffffffff0007000000000020ffffffff00050000 trusted.SGI_ACL_FILE=0x0000000500000001ffffffff0007000000000004ffffffff0007000000000008000027100007000000000010ffffffff0007000000000020ffffffff00050000 trusted.gfid=0xb1ba24b05f5a4dd4aa98933e8d138a98 trusted.glusterfs.dht=0x000000000000000000000000ffffffff trusted.glusterfs.mdata=0x010000000000000000000000005f7d3b99000000002b3b5a41000000005f7d3b99000000002b3b5a41000000005f57065700000000337db547

but the contents are different there is a "1279" folder on B & C, yet not on A.

pranithk commented 3 years ago

@NanyinK Can you do a fresh mount on any of the clients and do a stat on the missing directory on the mount? I think the directories are cached, so the healing is not getting triggered.

NanyinK commented 3 years ago

Hi @pranithk

Im afraid I cant do that since it's the PROD env which is in use.

by the way, are we trying to use the good file on A to recover the bad ones on B & C ?

pranithk commented 3 years ago

@NanyinK As per the xattrs B&C are good so A should get the contents.

pranithk commented 3 years ago

Hi @pranithk

Im afraid I cant do that since it's the PROD env which is in use.

by the way, are we trying to use the good file on A to recover the bad ones on B & C ?

Are you saying you can't do a fresh mount? If yes, try doing stat again on the old mount in case it can trigger the heal.

NanyinK commented 3 years ago

Hi @pranithk

I noticed one thing,

my client is mounting A which doesnt have the file, but the file is there on the client.

I mean

for example, just now we are talking about file /a/1.txt

on A, there is no /a/1.txt, yet on B & C, there is. my client is mounting on A, but there is /a/1.txt

NanyinK commented 3 years ago

oh the file is on A now

NanyinK commented 3 years ago

after I do the stat on the mount client

NanyinK commented 3 years ago

but the entries shown in gluster heal info added one.

Number of entries: 367

NanyinK commented 3 years ago

amazing.. the number reduced to 87

Number of entries: 87

NanyinK commented 3 years ago

it stops at 87, can we continue?

pranithk commented 3 years ago

it stops at 87, can we continue?

I think there is one more such directory which is in similar state maybe. You need to follow the same process to figure it out.

pranithk commented 3 years ago

Hi @pranithk

I noticed one thing,

my client is mounting A which doesnt have the file, but the file is there on the client.

I mean

for example, just now we are talking about file /a/1.txt

on A, there is no /a/1.txt, yet on B & C, there is. my client is mounting on A, but there is /a/1.txt

Gluster will pick the correct copy to show the file. I am curious to know how you landed up in this state. What happened on A that it required heals?

NanyinK commented 3 years ago

@pranithk

yep, I think so, yet when I do the same thing to the current entries, it doesnt return a file path to me :(

readlink -f /data/brick1/gv0/.glusterfs/08/4c/084c9b92-e32c-4b83-8302-8d788e463d5e /data/brick1/gv0/.glusterfs/08/4c/084c9b92-e32c-4b83-8302-8d788e463d5e

NanyinK commented 3 years ago

Gluster will pick the correct copy to show the file. I am curious to know how you landed up in this state. What happened on A that it required heals?

I am curious as well, haha.

Previously I did an upgrade for this env from gluster6.0 to gluster 6.10, then the entries started to grow for a while from around 200 to 366.

I think we are just using it normally but it did show us a few kinds of issues

pranithk commented 3 years ago

@pranithk

yep, I think so, yet when I do the same thing to the current entries, it doesnt return a file path to me :(

readlink -f /data/brick1/gv0/.glusterfs/08/4c/084c9b92-e32c-4b83-8302-8d788e463d5e /data/brick1/gv0/.glusterfs/08/4c/084c9b92-e32c-4b83-8302-8d788e463d5e

You need to find one that is a directory. do ls -l on that path and see if it is a softlink pointing to some directory.

pranithk commented 3 years ago

Gluster will pick the correct copy to show the file. I am curious to know how you landed up in this state. What happened on A that it required heals?

I am curious as well, haha.

Previously I did an upgrade for this env from gluster6.0 to gluster 6.10, then the entries started to grow for a while from around 200 to 366.

How did you do the upgrade?

I think we are just using it normally but it did show us a few kinds of issues

NanyinK commented 3 years ago

You need to find one that is a directory. do ls -l on that path and see if it is a softlink pointing to some directory.

the no. reduced to 12 and I cannot proceed now

ll /data/brick1/gv0/.glusterfs/08/4c/084c9b92-e32c-4b83-8302-8d788e463d5e -rw-r--r--+ 1 10000 10000 0 Jun 17 2020 /data/brick1/gv0/.glusterfs/08/4c/084c9b92-e32c-4b83-8302-8d788e463d5e

the inode is "1", and there's no soft link.

does it mean I can simply remove it?

NanyinK commented 3 years ago

Gluster will pick the correct copy to show the file. I am curious to know how you landed up in this state. What happened on A that it required heals?

I am curious as well, haha. Previously I did an upgrade for this env from gluster6.0 to gluster 6.10, then the entries started to grow for a while from around 200 to 366.

How did you do the upgrade?

I think we are just using it normally but it did show us a few kinds of issues

I followed the gluster official doc.

umount the volume on the clients, stop the cluster, kill the gluster process, yum update, and start, then mount

pranithk commented 3 years ago

Gluster will pick the correct copy to show the file. I am curious to know how you landed up in this state. What happened on A that it required heals?

I am curious as well, haha. Previously I did an upgrade for this env from gluster6.0 to gluster 6.10, then the entries started to grow for a while from around 200 to 366.

How did you do the upgrade?

I think we are just using it normally but it did show us a few kinds of issues

I followed the gluster official doc.

umount the volume on the clients, stop the cluster, kill the gluster process, yum update, and start, then mount

I think I/O was in progress when clients were unmounted probably.

pranithk commented 3 years ago

You need to find one that is a directory. do ls -l on that path and see if it is a softlink pointing to some directory.

the no. reduced to 12 and I cannot proceed now

ll /data/brick1/gv0/.glusterfs/08/4c/084c9b92-e32c-4b83-8302-8d788e463d5e -rw-r--r--+ 1 10000 10000 0 Jun 17 2020 /data/brick1/gv0/.glusterfs/08/4c/084c9b92-e32c-4b83-8302-8d788e463d5e

the inode is "1", and there's no soft link.

does it mean I can simply remove it?

If the link count is '1' and it is not a softlink, you can remove it.

pranithk commented 3 years ago

Gluster will pick the correct copy to show the file. I am curious to know how you landed up in this state. What happened on A that it required heals?

I am curious as well, haha. Previously I did an upgrade for this env from gluster6.0 to gluster 6.10, then the entries started to grow for a while from around 200 to 366.

How did you do the upgrade?

I think we are just using it normally but it did show us a few kinds of issues

I followed the gluster official doc. umount the volume on the clients, stop the cluster, kill the gluster process, yum update, and start, then mount

I think I/O was in progress when clients were unmounted probably.

What do you think? I/O could be in progress when you did this step?

NanyinK commented 3 years ago

You need to find one that is a directory. do ls -l on that path and see if it is a softlink pointing to some directory.

the no. reduced to 12 and I cannot proceed now ll /data/brick1/gv0/.glusterfs/08/4c/084c9b92-e32c-4b83-8302-8d788e463d5e -rw-r--r--+ 1 10000 10000 0 Jun 17 2020 /data/brick1/gv0/.glusterfs/08/4c/084c9b92-e32c-4b83-8302-8d788e463d5e the inode is "1", and there's no soft link. does it mean I can simply remove it?

If the link count is '1' and it is not a softlink, you can remove it.

yep, no the number has reduced to 0.

NanyinK commented 3 years ago

Gluster will pick the correct copy to show the file. I am curious to know how you landed up in this state. What happened on A that it required heals?

I am curious as well, haha. Previously I did an upgrade for this env from gluster6.0 to gluster 6.10, then the entries started to grow for a while from around 200 to 366.

How did you do the upgrade?

I think we are just using it normally but it did show us a few kinds of issues

I followed the gluster official doc. umount the volume on the clients, stop the cluster, kill the gluster process, yum update, and start, then mount

I think I/O was in progress when clients were unmounted probably.

What do you think? I/O could be in progress when you did this step?

hey @pranithk, what does it mean by I/O in progress？

I can see the load has reduced a little bit as well (from 15 to around 10), yet it could because there are few things running on my env.

NanyinK commented 3 years ago

oh it becomes high again, one of my bricks is 13 CPU load now, another is 2

NanyinK commented 3 years ago

and my monitor tool is telling me the server I/O WAIT is flapping

pranithk commented 3 years ago

Gluster will pick the correct copy to show the file. I am curious to know how you landed up in this state. What happened on A that it required heals?

I am curious as well, haha. Previously I did an upgrade for this env from gluster6.0 to gluster 6.10, then the entries started to grow for a while from around 200 to 366.

How did you do the upgrade?

I think we are just using it normally but it did show us a few kinds of issues

I followed the gluster official doc. umount the volume on the clients, stop the cluster, kill the gluster process, yum update, and start, then mount

I think I/O was in progress when clients were unmounted probably.

What do you think? I/O could be in progress when you did this step?

hey @pranithk, what does it mean by I/O in progress？

I can see the load has reduced a little bit as well (from 15 to around 10), yet it could because there are few things running on my env.

You said on A you did an upgrade and that lead to this heal info, right? For that you said you did an unmount. I was wondering why doing unmount and killing the bricks, was there any I/O on the client? Or all the I/O is stopped?

NanyinK commented 3 years ago

It seems that the sum of the cpu load on both bricks is always one value, (about 16), but I don't want the stress to be concentrated on one of the bricks, is there any way we can do it?

gluster / glusterfs

gluster volume heal <volume> info shows a long list of "gfid" #2331