martymac / fpart

Sort files and pack them into partitions
https://www.fpart.org/
BSD 2-Clause "Simplified" License
232 stars 39 forks source link

Better support for hard links across large directory structures #8

Closed survient closed 5 years ago

survient commented 5 years ago

I'm intending to use fpsync or fpart with a wrapper to transfer a large number of small files(~9 TB) that have some directories with a fair number of hard links(think linux repositories) to save on disk space. I tried this over the past weekend with fpsync passing the -o "-lptgoDH argument to it and it ran really well with high throughput across multiple worker nodes vs vanilla rsync, I noted that the size on disk was quickly exceeding 10TB after letting it run for an extended duration. A followup vanilla rsync(-avH) wound up clearing up the "duplicate" files, restoring any hard links in the process. I think the individual rsyncs are respecting the "-H" flag but only for the data they are tasks to replicate.

I may need to do this again down the road so I'd like to leverage fpsync or fpart while still accounting for hard links across the filesystem.

martymac commented 5 years ago

Hi survient,

As you noticed, rsync can detect and replicate hard links with option -H but that will only work on a per-rsync-run basis, not with fpsync.

To be able to propagate hard links with fpsync, fpart would have to guarantee that all related links belong to the same partition. Fpart cannot do that because in live mode (the mode used by fpsync to start synchronization as soon as possible), it crawls the filesystem as it comes: there is no mean to know if a hard link connected to a file already written to a partition (and probably already synchronized through an independent rsync process) will appear later or not.

In non-live mode, trying to group related hardlinks into the same partitions would propably lead to un-balanced partitions as well as complexify code. You can read my "Notes about cpio tool" regarding directory metadata handling in the main README because we are dealing with the same kinds of problems here.

So I am afraid this is something fpsync cannot -and will probably not- cope with. As a workaround and as you wrote, you can always run a final -monolithic- rsync that will be able to re-create your hard links, because only a monolithic rsync process has an overall knowledge of the filesystem being transfered.

Best regards, Ganael.

survient commented 5 years ago

Thanks Ganael, glad to get that clarification. I think what I'll do then is do a preliminary "find" command that has the -type f -links +1 argument, then do an initial copy with the listed files ensuring that the hardlinks get put in place. From there I should be able to run fpsync normally and since rsync will detect that the files are present it won't overwrite the hardlinked file. I will say I was very impressed with the performance I was seeing and am looking forward to using it again in future data migrations(I have 4 more to go)

martymac commented 5 years ago

Not sure your solution will work: if a file has changed between your first -manual- pass and the following fpsync pass(es), rsync will probably break the hard link and re-create the (modified) file as a regular one ; this has to be tested.

Anyway, I am glad to read that fpsync performed well during your data migration :)

survient commented 5 years ago

Definitely a valid concern and likely to happen with some files but with the content we are replicating I would only expect this to affect a very small percentage of the total. Nothing a final rsync pass won't fix!

martymac commented 5 years ago

OK, that's good news (and indeed, a final rsync pass would fix the pb).

I'll add a not about hard links in the README, thanks for having pointed that out!