trapexit / mergerfs-tools

Optional tools to help manage data in a mergerfs pool
ISC License
385 stars 44 forks source link

Optimization of mergerfs.dup for already duplicated datasets #140

Open exp625 opened 1 year ago

exp625 commented 1 year ago

I have observed that the mergerfs.dup command takes a significant amount of time to execute on a dataset that has already been duplicated. Currently, I have the following setup:

/mnt/disk1:/mnt/disk2 /mnt/pool

Within the /mnt/pool directory, there is a folder called /mnt/pool/data containing approximately 273GB of data with 147,148 files. My objective is to maintain duplicate copies of this folder on both drives. To achieve this, I am using the command /usr/local/bin/mergerfs.dup -d newest -c 2 -e /mnt/pool/data.

The execution of this command takes approximately 45 minutes, even when no actual copying is required. The script performs an rsync overwrite for each file.

To optimize the performance, I propose a modification to the *_dupfun functions to return if an overwrite is necessary:

  def newest_dupfun(default_basepath,relpath,basepaths):
      sts = dict([(f,os.lstat(os.path.join(f,relpath))) for f in basepaths])

      mtime = sts[basepaths[0]].st_mtime
      if not all([st.st_mtime == mtime for st in sts.values()]):
          return sorted(sts,key=lambda x: sts.get(x).st_mtime,reverse=True)[0], True

      ctime = sts[basepaths[0]].st_ctime
      if not all([st.st_ctime == ctime for st in sts.values()]):
          return sorted(sts,key=lambda x: sts.get(x).st_ctime,reverse=True)[0], False

      return default_basepath, False

Modifying the call of the *_dupfun functions

srcpath, overwrite = dupfun(basepath,relpath,existing)

Then, a simple check can be added to determine whether an overwrite is necessary before executing the rsync command´:

  for tgtpath in existing:
                    if prune and i >= count:
                        break
                    copies.append(tgtpath)
                    if overwrite:
                      args = build_copy_file(srcpath,tgtpath,relpath)
                      print('# overwrite')
                      print_args(args)
                      if execute:
                        execute_cmd(args)
                    i += 1

These changes have significantly improved the performance, reducing the script execution time to just 1 minute. Furthermore, the output log now only displays actual changes made to the file system. The rsync overwrites never actually did anything as the files were already duplicated.

This change shoud improve the performace for all the *_dupfun functions except for mergerfs_dupfun where

def mergerfs_dupfun(default_basepath,relpath,basepaths):
    return default_basepath, True

would trigger an overwrite everytime, because no other check is possible.

Please let me know if you can spot any issues. If you'd like I can creating a merge request for these changes for you to review.

sjtuross commented 9 months ago

I find this issue as I'm looking for a way to disable overwrite.