jbrodriguez / unbalance

Go/React/Tailwind app to move folders/files between Unraid disks
MIT License
186 stars 12 forks source link

Moving breaks hardlinks #104

Open sanderai opened 2 months ago

sanderai commented 2 months ago

On disk1 I have two folders filled with files, one is a hardlink to the other and takes no space. If I use the plugin to move these folders to another drive, they take up twice as much space because the hardlinks break and both folders take up actual disk space (I selected them both when doing the move action). After running jdupes command in a separate console on those folders, the disk size is back to where it was only taking up half the amount but this is a lengthy process because jdupes has to calculate the hashes for all files an has no previous knowledge of the move.

Unraid's own mover also takes up double the space during the move but still re-instates the hardlinks after the mover has finished (without recalculating the hashes afaik), so no space is lost at the end of moves.

Is there a way to make this plugin also respect hardlinks?

jbrodriguez commented 2 months ago

you can add custom flag for the underlying rsync command, check the settings page

sanderai commented 2 months ago

I had a test run and I guess it's harder than just using a flag because it runs a separate rsync command for each file separately. When I only selected the two folders that had one file in them that are originally hardlinked, they are copied as separate files and no way for rsync to link them back later.

One option could be that before each run the files are checked for hardlinks and if there are any, check if any of the other files checked for transfer also have the same unique identifer (ls -i shows this for ex.) and if they do, move those files together in the same rsync command. But then the problem is that they could have a different destination folder on the new drive and I don't know if that can be set up as one rsync command.

One more hacky option I found was described by someone on this level1techs forum post where they do something similar but first move those files to one temp folder, then move the whole folder as one command with the -H flag and then afterward move the files back to their respective positions on the new drive. Also kind of a hassle though.

I wonder how the original unraid mover does this - it seems to copy everything over and then add back hardlinks as the last step of the move (if you somehow stop the mover before it finishes, the hardlinks stay broken).

So there might not be a simple answer to this and would require quite a bit of extra work both when setting up the commands pre-transfer and also some post-processing after transfers.

undaunt commented 3 weeks ago

The issue with adding -H is that, eg: I have a share 'tv' and the subfolders are 'media' and 'downloads', unbalanced runs each subfolder separately, so even though -H is passed through the hardlinks are killed because both inodes of the file are moved in separate rsync executions. I previously mentioned this as an issue on the Unraid forums.

The rsync commands would need to be run at the root of a given share on a given disk instead of the current implementation to prevent this where it seems to dig down into each subfolder for its own rsync progress.

jbrodriguez commented 3 weeks ago

@undaunt it actually runs from the root of the source folder

i figured the need for hardlink support once i saw an article or video ? about optimizing unraid performance, i also remember *arr apps have that optimization suggestion as well

as @sanderaido mentioned though, it does seem like additional logic is needed

dont have the bandwidth to take this on atm, if someone has ideas or even better a PR, i will consider merging

having said that, i don't think this is an easy task

Snake883 commented 2 weeks ago

I have run into this problem right now. Wish I knew about it before I started the move/transfer/scatter, so perhaps I could have avoided the issues.

I am moving data off of a 6TB drive onto a new 12 TB drive. And so far, I have transferred 8TB from my a 6TB drive! And the transfer is not complete yet.

I have two issues/questions now: 1) Is my 12TB drive going to be large enough? 2) How am I going to restore all of my hardlinks and recover the drive space? 3) How do I avoid this situation in the future?

Thank you.

sanderai commented 2 weeks ago
  • Is my 12TB drive going to be large enough?

If you only had two locations for each file, then you might be good. It's possible to have 2+ hardlinks for a file and then this growth could be infinite, depending on the amount, but I doubt you had more than one copy.

  • How am I going to restore all of my hardlinks and recover the drive space?

Install the Nerdtools plugin and from that, install/activate jdupes. Then in your console, run something like jdupes --recurse --link-hard /mnt/user/media/downloads/ /mnt/user/media/movies/ (or whatever libraries/files you have). Or the shorthand: jdupes -rL /mnt/user/media/downloads/ /mnt/user/media/movies/ This command will take a long time but it will calculate hashes etc for all files and find all matching entries and relink them, after it completes your disk usage should halve again.

  • How do I avoid this situation in the future?

I haven't found a good solution for hardlinked files yet, better to not manually move them and let Unraid mover handle them (mover keeps hardlinks)

Snake883 commented 2 weeks ago

Great!...thank you for the path forward!

As a potential "solution".... Perhaps jdupes can be integrated in Unbalanced. Or perhaps Unbalance does a hard link scan, display warning, provide additional information/guidance.

Wish the Unraid mover let me choose the drive to move to.

undaunt commented 2 weeks ago

@jbrodriguez I'm not sure what you mean by it runs at the root of the source folder. If I have a mount point at, for example, /mnt/disk1/movies, and there are subfolders of /mnt/disk1/movies/downloads and /mnt/disk1/movies/plex, two separate rsync jobs appear to be kicked off each at the plex and movies subfolder level. This is what breaks the hardlinks and creates extra data usage on the disk.

To @sanderai's point, yes jdupes, fdupes, rdfind, czkawka, etc. can be used after the fact, but avoiding this in the first place would speed up processing time and reduce disk usage temporarily being inflated.

I wound up simply manually running a custom bash function with rsync inside to speed this process up via the command line. I called it 'rsafe' for no specific reason. It requires screen being installed (I think via nerdtools, if its not already included) and takes an input path of "diskx/sharename", and an output path of "disky". It names a screen session after the data being moved to track things easily later with screen -ls.

Eg: to copy /mnt/disk1/movies to /mnt/disk6, you would run `rsafe disk1/movies disk8' or similar. It will ensure a trailing slash exists on the destination, and not the source, such that things land in (in this example) /mnt/disk6/movies.

rsafe() {
    local source="$1"
    local destination="$2"

    # Store original arguments for messaging
    local original_source="$source"
    local original_destination="$destination"

    # Remove trailing slash from source if it exists
    source="${source%/}"

    # Ensure trailing slash on destination if it's not present
    [[ "$destination" != */ ]] && destination="$destination/"

    # Get absolute paths
    local abs_source
    abs_source="$(readlink -f "$source")"
    local abs_destination
    abs_destination="$(readlink -f "$destination")"

    # Extract disk and folder names from the source and destination paths
    # Adjust field numbers if your path structure is different
    local source_disk
    source_disk="$(echo "$abs_source" | awk -F'/' '{print $3}')"
    local source_folder
    source_folder="$(echo "$abs_source" | awk -F'/' '{print $4}')"
    local destination_disk
    destination_disk="$(echo "$abs_destination" | awk -F'/' '{print $3}')"

    # Create a session name
    local session_name="rsafe_${source_disk}_${source_folder}_to_${destination_disk}"

    # Replace any spaces with underscores in session name
    session_name="${session_name// /_}"

    # Display the session name
    echo "Starting transfer from '$abs_source' to '$abs_destination' in screen session '$session_name'."

    # File to store exit status
    local exit_status_file="/tmp/rsafe_exit_status_${session_name}"

    # Run rsync and deletion inside screen session
    screen -dmS "$session_name" bash -c "
        rsync -aHAXvpP --info=progress2 '$abs_source' '$abs_destination'
        rsync_exit_status=\$?
        if [ \$rsync_exit_status -eq 0 ]; then
            find '$abs_source' -mindepth 1 -delete
            deletion_exit_status=\$?
            exit_status=\$((rsync_exit_status + deletion_exit_status))
            echo \$exit_status > '$exit_status_file'
            if [ \$deletion_exit_status -eq 0 ]; then
                echo 'Transfer from \"$original_source\" to \"$original_destination\" completed successfully. Source files deleted.'
            else
                echo 'Transfer completed, but failed to delete source files.' >&2
            fi
        else
            echo \$rsync_exit_status > '$exit_status_file'
            echo 'rsync from \"$original_source\" to \"$original_destination\" failed. Not deleting source files.' >&2
        fi
    "

    # Attach to the screen session (optional)
    screen -r "$session_name"

    # After detaching, wait for the screen session to finish
    while screen -list | grep -q "$session_name"; do
        sleep 10
    done

    # Read the exit status
    if [ -f "$exit_status_file" ]; then
        exit_status=$(cat "$exit_status_file")
        rm -f "$exit_status_file"
    else
        exit_status=1  # Assume failure if exit status file not found
    fi

    # Final message based on exit status
    if [ "$exit_status" -eq 0 ]; then
        echo "Process completed successfully."
    else
        echo "Process failed or was terminated."
    fi
}

However, for users who just want to run a simple command, manually, and watch a transfer, run:

rsync -aHAXvpP --info=progress2 /mnt/diskx/sharename /mnt/disky/

jbrodriguez commented 2 weeks ago

3. How do I avoid this situation in the future?

not sure how to do that, but thankfully there are "solutions", as shown by @sanderai (nice stuff!)

@jbrodriguez I'm not sure what you mean by it runs at the root of the source folder

using your example rsync -aHAXvpP --info=progress2 /mnt/diskx/sharename /mnt/disky/, unbalanced does

cd /mnt/diskx/
rsync -avPR -X "sharename" "/mnt/disky"

don't exactly remember, why i implemented it like this, but this was WAYYYYY back, talking about ~2017

it's interesting the fact that mover just works, iirc mover is a shell script

@undaunt 's script also uses just an rsync command, haven't read through the script code, but the fact that it handles hardlinks properly means i could potentially incorporate it into unbalanced, thanks for sharing it !

sanderai commented 2 weeks ago

it's interesting the fact that mover just works, iirc mover is a shell script

I don't know exactly how the mover does this since I haven't looked into it's code, but it's not creating the hardlinks on the fly per file. It's probably storing the hard-link info beforehand (ls -i command shows that info for example) and then relinks them all again after the mover job completes. If you stop the process before it can finish, all the hardlinks are broken also. This is also evident by looking at the destination drive usage, it's growing by 2x the size until the end of the job when it cuts to the "normal" hardlinked size again.

But that last part is very fast and they definitely don't do recalculations after the job, just a simple reapply since they already know which files were linked to which before they started the whole move. jdupes should only be used if you truly lost that information and want to recalculate it again (it can also keep a hash-log actually with a few extra parameters for faster subsequent usages on the same file tree)

Snake883 commented 2 weeks ago

Why not add the -H to rsync by default? It may not always work, but at least it might work for some of the scenarios, and save time recreating the hard links.

Any reason to not use -H ?

sanderai commented 2 weeks ago

The -H flag only works if you transfer the linked files together in one rsync command. If you do separate rsync commands for both files in their separate locations, that link is broken regardless of the flag. And currently unBalanced seems to create lots of small rsync transfers based on selected folders and files and not transfering them all in one big command. This would need some bigger rewrites I guess.

I unfortunately don't have more time to test this or work towards a PR right now.

undaunt commented 2 weeks ago

@sanderai The mover enhanced plugin actually moves files and the hardlinks side by side unlike native mover which does all the hardlinks at the end. So, for heavy hardlink workloads, the plugin may add value.