gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.51k stars 1.07k forks source link

[BUG] Geo-rep transfers large files multiple times while using Tar+ssh #4321

Open harikrishnannm opened 1 month ago

harikrishnannm commented 1 month ago

Description of problem:

I am using glusterfs 6.6 version with georeplication enabled. Geo-rep uses Tar+ssh to transfer the files from primary site to secondary site.

When large file are uploaded to primary site, it is being sent multiple time.

Example, when a 92 GB file was uploaded, it took more than 20 hours to complete. On checking this, its found that the same file got recorded in multiple changelog files, 127 changelogs to be exact, this caused the file to be sent 127 times to the secondary site, which effectively increase the datatransfer to be around 11 TB (92 GB x 127 = 11684 GB ~ 11 TB).

what if we make the changes in the master files (geo-replication/syncdaemon/master.py) something like this :

diff --git a/geo-replication/syncdaemon/master.py b/geo-replication/syncdaemon/master.py
index f02cdb4c7f..430f8fff29 100644
--- a/geo-replication/syncdaemon/master.py
+++ b/geo-replication/syncdaemon/master.py
@@ -1281,6 +1281,7 @@ class GMasterChangelogMixin(GMasterCommon):
             self.batch_stats["DATA_START_TIME"] = time.time()

         # sync data
+        datas = datas - self.datas_in_batch
         if datas:
             self.a_syncdata(datas)
             self.datas_in_batch.update(datas)

The exact command to reproduce the issue:

create a file using the below command on primary site:

dd if=/dev/zero of=/path/to/file bs=32k count=2949120

It will take some time for the file to be created, which in turn will be recorded in multiple changelogs.

Additional info:

Commands used in georeplication (snippet from geo-replication/syncdaemon/resource.py file)

At primary side:

tar_cmd = ["tar"] + \
-    ["--sparse", "-cf", "-", "--files-from", "-"]
+    ["--record-size=128K", "-cf", "-", "--files-from", "-"]

At secondary side

ssh_cmd = gconf.get("ssh-command").split() + \
    gconf.get("ssh-options-tar").split() + \
    ["-p", str(gconf.get("ssh-port"))] + \
    [host, "tar"] + \
-    ["--overwrite", "-xf", "-", "-C", rdir]
+    ["--overwrite", "--record-size=128K","-xf", "-", "-C", rdir]

Other info:

Volume Name: xxxx Type: Distributed-Replicate Volume ID: xxxxx Status: Started Snapshot Count: 0 Number of Bricks: 30 x 3 = 90 Transport-type: tcp

For geo-replication we are using cert based unprivileged user (geo-rep-user)

sync_method:tarssh

- The operating system / glusterfs version:

OS: Ubuntu Focal Glusterfs version : 6.6

pranithk commented 1 month ago

@aravindavk Do you see any issue in the fix suggested? We are also going to upstream the changes he mentioned for tar. We achieved 15X speed up using the --record-size=128K

I also verified that the issue is present in the latest release.

aravindavk commented 1 month ago

Removing duplicate data entries looks good. But it will not solve the duplicate issues completely. Geo-rep processes the changeling files in a loop. If the changes recorded in different batches then there is a chance that it gets picked up in the next batch processing. I think it is a good patch since it reduces the chances.

--record-size=128K looks good.

--sparse option is required to handle the sparse files seperately. If it is removed only for testing the file created using dd then fine (Patch mentioned in the issue removed this option).

pranithk commented 1 month ago

Removing duplicate data entries looks good. But it will not solve the duplicate issues completely. Geo-rep processes the changeling files in a loop. If the changes recorded in different batches then there is a chance that it gets picked up in the next batch processing. I think it is a good patch since it reduces the chances.

Correct. This is only an optimisation.

--record-size=128K looks good.

We will send a PR for this as well.

--sparse option is required to handle the sparse files seperately. If it is removed only for testing the file created using dd then fine (Patch mentioned in the issue removed this option).

Our application doesn't produce sparse files. To get the correct fix we have to upgrade, so we went with this approach until upgrade.