gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.73k stars 1.08k forks source link

File changes not update to final volume in cascading geo-replication setup #4200

Open ccy opened 1 year ago

ccy commented 1 year ago

Description of problem:

I have three gluster zones and I create a gluster volume (xfs) for each zone using the following layout:

Zone A: 2 nodes, replica 2
Zone B: 2 nodes, replica 2
Zone C: 1 node, distribute

I then setup cascading geo-replication from Zone A -> Zone B -> Zone C.

I mount volume xfs in Zone A, and dump a 5.5GB Windows 22H2 ISO file (sha1sum: a3f1a71bb41517e2ef07f09093bbcdca4f5c313c) to the mounting point. After a while I notice replication task dump the file from Zone A to Zone B, then Zone B to Zone C.

Once the geo-replication task is completed, I use sha1sum to check the file in three zone's volume and I get same sha1 hash value. So far so good.

Next, I try to modify last byte (using hexedit) in the file. I can notice the geo-replication from Zone A to Zone B always success, but Zone B to Zone C mostly fail to work. I wait overnight and hope it can propagate to C but it doesn't. I use sha1sum to check the file content for all 3 zones.

If I update 1st or 2nd byte of the file, the geo-replication works by replication from zone A, B and C.

This log message keep show up in Zone B every minute:

==> /var/log/glusterfs/geo-replication/xfs_gfs-c1_xfs/gsyncd.log <==
[2023-07-13 06:29:42.456403] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=3}, {num_files=1}, {return_code=23}, {duration=0.1733}]
[2023-07-13 06:29:42.456600] W [primary(worker /mnt/brick/abc):1410:process] <top>: incomplete sync, retrying changelogs [{files=['CHANGELOG.1689229770']}]
[2023-07-13 06:29:43.466236] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=1}, {num_files=1}, {return_code=23}, {duration=0.1731}]
[2023-07-13 06:29:43.466490] E [primary(worker /mnt/brick/abc):1388:process] _GPrimary: changelogs could not be processed completely - moving on... [{files=['CHANGELOG.1689229770']}]
[2023-07-13 06:29:43.478993] I [primary(worker /mnt/brick/abc):1429:process] _GPrimary: Entry Time Taken [{UNL=0}, {RMD=0}, {CRE=0}, {MKN=0}, {MKD=0}, {REN=0}, {LIN=0}, {SYM=0}, {duration=0.0000}]
[2023-07-13 06:29:43.479052] I [primary(worker /mnt/brick/abc):1441:process] _GPrimary: Data/Metadata Time Taken [{SETA=0}, {meta_duration=0.0000}, {SETX=0}, {XATT=0}, {DATA=0}, {data_duration=1689229783.4790}]
[2023-07-13 06:29:43.479136] I [primary(worker /mnt/brick/abc):1451:process] _GPrimary: Batch Completed [{mode=live_changelog}, {duration=9.9611}, {changelog_start=1689229770}, {changelog_end=1689229770}, {num_changelogs=1}, {stime=(1689229769, 0)}, {entry_stime=(1689212054, 0)}]

And the directory /mnt/brick/abc/.glusterfs/changelogs/2023/07/13 will have new CHANGELOG file

The exact command to reproduce the issue:

Use hexedit to modify file.

The full output of the command that failed:

Expected results: Cascading geo-replication should work even small change in source volume

Mandatory info: - The output of the gluster volume info command:

# Zone A
Volume Name: xfs
Type: Replicate
Volume ID: 674c77bc-7332-4067-982c-6c3642fe0be5
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs-a1:/mnt/brick/abc
Brick2: gfs-a2:/mnt/brick/abc
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

# Zone B
Volume Name: xfs
Type: Replicate
Volume ID: 5e3737f1-00bf-4328-afc5-584fe0aedc05
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs-b1:/mnt/brick/abc
Brick2: gfs-b2:/mnt/brick/abc
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.read-only: on
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on

# Zone C
Volume Name: xfs
Type: Distribute
Volume ID: aa6f13bd-b067-4d18-b726-ea02435439f1
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: gfs-c1:/mnt/brick/abc
Options Reconfigured:
cluster.self-heal-daemon: enable
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on

- The output of the gluster volume status command:

# Zone A
Status of volume: xfs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gfs-a1:/mnt/brick/abc                 59424     0          Y       2863494
Brick gfs-a2:/mnt/brick/abc                 53895     0          Y       2433056
Self-heal Daemon on localhost               N/A       N/A        Y       2843715
Self-heal Daemon on gfs-a2                  N/A       N/A        Y       481613
Self-heal Daemon on gfs-a3                  N/A       N/A        Y       1635

Task Status of Volume xfs
------------------------------------------------------------------------------
There are no active volume tasks

# Zone B
Status of volume: xfs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gfs-b1:/mnt/brick/abc                 58327     0          Y       3316172
Brick gfs-b2:/mnt/brick/abc                 50112     0          Y       3817308
Self-heal Daemon on localhost               N/A       N/A        Y       3820796
Self-heal Daemon on gfs-b1                  N/A       N/A        Y       3318985
Self-heal Daemon on gfs-b3                  N/A       N/A        Y       51016

Task Status of Volume xfs
------------------------------------------------------------------------------
There are no active volume tasks

# Zone C
Status of volume: xfs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gfs-c1:/mnt/brick/abc                 52939     0          Y       800521

Task Status of Volume xfs
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:

# Zone A
Brick gfs-a1:/mnt/brick/abc
Status: Connected
Number of entries: 0

Brick gfs-a2:/mnt/brick/abc
Status: Connected
Number of entries: 0

# Zone B
Brick gfs-b1:/mnt/brick/abc
Status: Connected
Number of entries: 0

Brick gfs-b2:/mnt/brick/abc
Status: Connected
Number of entries: 0

# Zone C (distributed volum)
This command is supported for only volumes of replicate/disperse type. Volume xfs is not of type replicate/disperse

**- Provide logs present on following locations of client and server nodes -

==> /var/log/glusterfs/geo-replication/xfs_gfs-c1_xfs/gsyncd.log <==
[2023-07-13 04:27:19.947517] I [primary(worker /mnt/brick/abc):1524:crawl] _GPrimary: secondary's time [{stime=(1689222374, 0)}]

==> /var/log/glusterfs/geo-replication/xfs_gfs-c1_xfs/mnt-mnt-brick-abc.log <==
[2023-07-13 04:27:20.201782 +0000] I [MSGID: 108031] [afr-common.c:3201:afr_local_discovery_cbk] 0-xfs-replicate-0: selecting local read_child xfs-client-0

==> /var/log/glusterfs/geo-replication/xfs_gfs-c1_xfs/gsyncd.log <==
[2023-07-13 04:27:20.327038] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=3}, {num_files=1}, {return_code=23}, {duration=0.1541}]
[2023-07-13 04:27:20.327319] W [primary(worker /mnt/brick/abc):1410:process] <top>: incomplete sync, retrying changelogs [{files=['CHANGELOG.1689222435']}]
[2023-07-13 04:27:21.406110] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=1}, {num_files=1}, {return_code=23}, {duration=0.1736}]
[2023-07-13 04:27:21.406388] W [primary(worker /mnt/brick/abc):1410:process] <top>: incomplete sync, retrying changelogs [{files=['CHANGELOG.1689222435']}]
[2023-07-13 04:27:22.465287] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=2}, {num_files=1}, {return_code=23}, {duration=0.1745}]
[2023-07-13 04:27:22.465494] W [primary(worker /mnt/brick/abc):1410:process] <top>: incomplete sync, retrying changelogs [{files=['CHANGELOG.1689222435']}]
[2023-07-13 04:27:23.503807] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=3}, {num_files=1}, {return_code=23}, {duration=0.1731}]
[2023-07-13 04:27:23.504027] W [primary(worker /mnt/brick/abc):1410:process] <top>: incomplete sync, retrying changelogs [{files=['CHANGELOG.1689222435']}]
[2023-07-13 04:27:24.584944] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=1}, {num_files=1}, {return_code=23}, {duration=0.1752}]
[2023-07-13 04:27:24.585249] W [primary(worker /mnt/brick/abc):1410:process] <top>: incomplete sync, retrying changelogs [{files=['CHANGELOG.1689222435']}]
[2023-07-13 04:27:25.643880] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=2}, {num_files=1}, {return_code=23}, {duration=0.1752}]
[2023-07-13 04:27:25.644174] W [primary(worker /mnt/brick/abc):1410:process] <top>: incomplete sync, retrying changelogs [{files=['CHANGELOG.1689222435']}]
[2023-07-13 04:27:26.680072] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=3}, {num_files=1}, {return_code=23}, {duration=0.1727}]
[2023-07-13 04:27:26.680316] W [primary(worker /mnt/brick/abc):1410:process] <top>: incomplete sync, retrying changelogs [{files=['CHANGELOG.1689222435']}]
[2023-07-13 04:27:27.762776] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=1}, {num_files=1}, {return_code=23}, {duration=0.1741}]
[2023-07-13 04:27:27.762966] W [primary(worker /mnt/brick/abc):1410:process] <top>: incomplete sync, retrying changelogs [{files=['CHANGELOG.1689222435']}]
[2023-07-13 04:27:28.820977] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=2}, {num_files=1}, {return_code=23}, {duration=0.1735}]
[2023-07-13 04:27:28.821195] W [primary(worker /mnt/brick/abc):1410:process] <top>: incomplete sync, retrying changelogs [{files=['CHANGELOG.1689222435']}]
[2023-07-13 04:27:29.836832] I [primary(worker /mnt/brick/abc):2009:syncjob] Syncer: Sync Time Taken [{job=3}, {num_files=1}, {return_code=23}, {duration=0.1531}]
[2023-07-13 04:27:29.837025] E [primary(worker /mnt/brick/abc):1388:process] _GPrimary: changelogs could not be processed completely - moving on... [{files=['CHANGELOG.1689222435']}]
[2023-07-13 04:27:29.864885] I [primary(worker /mnt/brick/abc):1429:process] _GPrimary: Entry Time Taken [{UNL=0}, {RMD=0}, {CRE=0}, {MKN=0}, {MKD=0}, {REN=0}, {LIN=0}, {SYM=0}, {duration=0.0000}]
[2023-07-13 04:27:29.864932] I [primary(worker /mnt/brick/abc):1441:process] _GPrimary: Data/Metadata Time Taken [{SETA=0}, {meta_duration=0.0000}, {SETX=0}, {XATT=0}, {DATA=0}, {data_duration=1689222449.8649}]
[2023-07-13 04:27:29.865018] I [primary(worker /mnt/brick/abc):1451:process] _GPrimary: Batch Completed [{mode=live_changelog}, {duration=9.9060}, {changelog_start=1689222435}, {changelog_end=1689222435}, {num_changelogs=1}, {stime=(1689222434, 0)}, {entry_stime=(1689212054, 0)}]
[2023-07-13 04:27:34.884063] I [primary(worker /mnt/brick/abc):1524:crawl] _GPrimary: secondary's time [{stime=(1689222434, 0)}]

**- Is there any crash ? Provide the backtrace and coredump No crash happen.

Additional info:

- The operating system / glusterfs version:

Rocky Linux 5.14.0-284.18.1.el9_2.x86_64
glusterfs 10.4
ccy commented 1 year ago

I found a workaround solution here. I suspect this is a bug happen to intermediate zone when performing rsync task in cascading geo-replication setup.

When I update last byte of the file in Zone A, and geo-replication task start:

# Show .processing directory
$ ls /var/lib/misc/gluster/gsyncd/..././.processing
total 4
-rw-r--r-- 1 root root 39 Jul 27 21:42 CHANGELOG.1690465326

# Show content of CHANGLOG file
$ cat CHANGELOG.1690465326
D c8b7999b-78ef-4894-b35a-4b413070194c

This is the log file content showing rsync shall perform update of 1 file (in Number of regular files transferred):

[2023-07-27 13:42:10.323244] D [primary(worker /mnt/brick/b):317:a_syncdata] _GPrimary: files [{files={'.gfid/c8b7999b-78ef-4894-b35a-4b413070194c'}}]
[2023-07-27 13:42:10.323301] D [primary(worker /mnt/brick/b):320:a_syncdata] _GPrimary: candidate for syncing [{file=.gfid/c8b7999b-78ef-4894-b35a-4b413070194c}]
[2023-07-27 13:42:10.353568] D [resource(worker /mnt/brick/b):1442:rsync] SSH: files: .gfid/c8b7999b-78ef-4894-b35a-4b413070194c
[2023-07-27 13:42:39.128166] I [resource(worker /mnt/brick/b):1530:rsync] SSH: rsync verbose [{
  data='building file list ... done
  .gfid/c8b7999b-78ef-4894-b35a-4b413070194c

  Number of files: 2 (reg: 1, dir: 1)
  Number of created files: 0
  Number of deleted files: 0
  Number of regular files transferred: 1
  Total file size: 5,842,186,240 bytes
  Total transferred file size: 5,842,186,240 bytes
  Literal data: 29,888 bytes
  Matched data: 5,842,156,352 bytes
  File list size: 0
  File list generation time: 0.001 seconds
  File list transfer time: 0.000 seconds
  Total bytes sent: 335,908
  Total bytes received: 611,565

  sent 335,908 bytes  received 611,565 bytes  32,117.73 bytes/sec
  total size is 5,842,186,240  speedup is 6,166.07
  '}]
[2023-07-27 13:42:39.128545] D [primary(worker /mnt/brick/b):326:regjob] _GPrimary: synced [{file=.gfid/c8b7999b-78ef-4894-b35a-4b413070194c}]

Next, the intermediate node of Zone B:

# Show .processing directory
$ ls /var/lib/misc/gluster/gsyncd/..././.processing
total 4
-rw-r--r-- 1 root root 48 Jul 27 21:50 CHANGELOG.1690465836

# Show content of CHANGLOG file
$ cat CHANGELOG.1690465836
D c8b7999b-78ef-4894-b35a-4b413070194c
M c8b7999b-78ef-4894-b35a-4b413070194c SETATTR
M 00000000-0000-0000-0000-000000000001 SETXATTR

This is the log file content showing rsync does not perform any update (in Number of regular files transferred). But the timestamp of the file do updated.

[{batch=['/var/lib/misc/gluster/gsyncd/b_gfs-c1_b/mnt-brick-b/.processing/CHANGELOG.1690465836']}]
[2023-07-27 13:50:38.544077] D [primary(worker /mnt/brick/b):1343:process] _GPrimary: processing change [{changelog=/var/lib/misc/gluster/gsyncd/b_gfs-c1_b/mnt-brick-b/.processing/CHANGELOG.1690465836}]
[2023-07-27 13:50:38.544209] D [primary(worker /mnt/brick/b):1207:process_change] _GPrimary: entries: []

[2023-07-27 13:50:38.545924] D [repce(worker /mnt/brick/b):195:push] RepceClient: call 2018616:139904444790592:1690465838.5458996 meta_ops([{'op': 'META', 'skip_entry': False, 'go': '.gfid/c8b7999b-78ef-4894-b35a-4b413070194c', 'stat': {'uid': 0, 'gid': 0, 'mode': 33188, 'atime': 1690355048.1178703, 'mtime': 1690465789.6717978}}],) ...

[2023-07-27 13:50:38.547595] D [repce(worker /mnt/brick/b):215:__call__] RepceClient: call 2018616:139904444790592:1690465838.5458996 meta_ops -> []
[2023-07-27 13:50:38.548894] D [primary(worker /mnt/brick/b):317:a_syncdata] _GPrimary: files [{files={'.gfid/c8b7999b-78ef-4894-b35a-4b413070194c'}}]
[2023-07-27 13:50:38.548938] D [primary(worker /mnt/brick/b):320:a_syncdata] _GPrimary: candidate for syncing [{file=.gfid/c8b7999b-78ef-4894-b35a-4b413070194c}]
[2023-07-27 13:50:38.842242] D [resource(worker /mnt/brick/b):1442:rsync] SSH: files: .gfid/c8b7999b-78ef-4894-b35a-4b413070194c
[2023-07-27 13:50:38.910505] I [resource(worker /mnt/brick/b):1530:rsync] SSH: rsync verbose [{
  data='building file list ... done

  Number of files: 2 (reg: 1, dir: 1)
  Number of created files: 0
  Number of deleted files: 0
  Number of regular files transferred: 0
  Total file size: 5,842,186,240 bytes
  Total transferred file size: 0 bytes
  Literal data: 0 bytes
  Matched data: 0 bytes
  File list size: 0
  File list generation time: 0.001 seconds
  File list transfer time: 0.000 seconds
  Total bytes sent: 216
  Total bytes received: 17

  sent 216 bytes  received 17 bytes  466.00 bytes/sec
  total size is 5,842,186,240  speedup is 25,073,760.69
'}]
[2023-07-27 13:50:38.910670] I [primary(worker /mnt/brick/b):2009:syncjob] Syncer: Sync Time Taken [{job=2}, {num_files=1}, {return_code=0}, {duration=0.0684}]

Here are the summary:

# Zone A CHANGELOG
D c8b7999b-78ef-4894-b35a-4b413070194c

# Zone B CHANGELOG
D c8b7999b-78ef-4894-b35a-4b413070194c
M c8b7999b-78ef-4894-b35a-4b413070194c SETATTR
M 00000000-0000-0000-0000-000000000001 SETXATTR

Zone B file timestamp updated. Content also updated. Zone C file timestamp updated. Content does not update.

This is related to a weakness rsync delta transfer algorithm. Zone C's timestamp has updated due to this change log:

M c8b7999b-78ef-4894-b35a-4b413070194c SETATTR`

before rsync operation:

rsync -v -aR0 --inplace --files-from=- --super --stats --numeric-ids --no-implied-dirs --xattrs --acls --existing ./ gfs-c1:/mnt/brick/b

My workaround solution is add rsync ignore-times flag (-I) to geo-replication config's rsync_options:

sudo gluster volume geo-replication $VOL $GEO_USER@$H2::$VOL config rsync_options '-I'