pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.96k stars 75 forks source link

Implement `fclones dedupe` command (reflink) #74

Closed th1000s closed 3 years ago

th1000s commented 3 years ago

This adds a --reflink option to fclones link (and for now does not update the help text). Maybe it should get its own subcommand because it sits between linking and copying files. And I think it definitely should not fall back to cross-device symlinks.

If the filesystem does not support reflinks it does not abort the entire link process but tries again, maybe the first failure should be more than a warning when reflinking, or the fs is tested first?

Because previously created reflinks are undetectable the space_to_reclaim() output is unreliable, maybe add an "up to" qualifier in that case?

To verify (manually) on btrfs run btrfs filesystem du testdir/ and use files at least a few kilobytes long.

pkolaczk commented 3 years ago

Thank you for contributing to fclones! Generally the PR looks good.

Reflink semantics is different than link / soft link so it definitely deserves a separate command. Let's call it dedupe. The word "link" adds IMHO confusion here, because at the logical level the files are not linked - you can modify each of them without affecting another.

th1000s commented 3 years ago

Added a dedupe command and now only use the reflink crate for macOS (and, theoretically, some Windows server operating systems, but that is super untested. Maybe disable it explicitly or require an --eat-my-data flag).

Since the direct Linux ioctl can overwrite existing files it is easy to keep all the metadata in place to make the deduplication truly transparent.

I saw the filetime crate was already included and used it to restore some metadata when the reflink crate is used, but see the TODO (and others) in the code. On macOS Sierra (10.12 from 2016) the stdlib fs::copy already uses clonefile so the reflink dependency could be dropped if windows support is not that important.

pkolaczk commented 3 years ago

Added a dedupe command and now only use the reflink crate for macOS (and, theoretically, some Windows server operating systems, but that is super untested. Maybe disable it explicitly or require an --eat-my-data flag).

Not sure if I understood, but reflink create is used on all systems except Linux Android, not except all Linux flavors. I can see there is target_os = "android" qualifier.

Honestly if the reflink crate doesn't support some usecase, try to:

There are some clippy warnings failing the CI, but I understand this is a draft. BTW: Which filesystems on Linux I can test it with?

pkolaczk commented 3 years ago

As for Windows support, if deduplication functionality is buggy and could cause data loss, better disable it by default, and enable only under --enable-experimental-options (need to add such flag). But I'm also fine just disabling it for now.

th1000s commented 3 years ago

On Linux (and Android, which Rust treats as a different Unix) the ioctl is called directly, this works without removing the destination, so it is easy to keep all metadata. Only macOS (and Windows) currently use the reflink crate. However when testing on Linux it is also used to test the the logic which manually restores the metadata.

Regarding contributing to reflink, I currently have a PR open for something else, and at the moment the crate doesn't see that many commits. I'll place all the deduplication and metadata-keeping code in a separate module for now and see about upstreaming the changes later.

On Linux btrfs and XFS support ioctl_ficlone, an mkfs should have this enabled without extra flags.

th1000s commented 3 years ago

The reflink functions are now in a separate file, and because in the case of deduplication is is known that the file are of equal size the destination is no longer truncated.

pkolaczk commented 3 years ago

Thank you, this PR is still marked as a draft. Is there any work remaining? I'd like to do a final review, merge it and release it. :)

th1000s commented 3 years ago

I added a warning regarding the --priority and --rf-over options to the dedupe --help output. The windows code path (if someone builds it there) should still be hidden behind an --experimental flag and somewhere the metadata loss there and on macOS needs to be mentioned or enabled via --lossy or so.

(Build fails? Works here on 1.48 and 1.55, but I can't see details, maybe try github actions? I think binaries can be built there as well, once you find the right YAML incantation.)

pkolaczk commented 3 years ago

Code formatting check failed. The build and tests passed. I'm fixing circleci definition so the next time it would be more clear.

pkolaczk commented 3 years ago
[2021-10-20 10:51:36.517] fclones:  info: Started deduplicating
[2021-10-20 10:51:36.618] fclones:  warn: Failed to reflink /home/pkolaczk/Projekty/fclones/test/foo.3 -> /home/pkolaczk/Projekty/fclones/test/foo.1: Failed to deduplicate /home/pkolaczk/Projekty/fclones/test/foo.3: Operation not supported (os error 95)
[2021-10-20 10:51:36.618] fclones:  warn: Failed to reflink /home/pkolaczk/Projekty/fclones/test/foo.2 -> /home/pkolaczk/Projekty/fclones/test/foo.1: Failed to deduplicate /home/pkolaczk/Projekty/fclones/test/foo.2: Operation not supported (os error 95)

Nit: These warn messages are a bit too verbose. Can you make them less redundant? Let's not report the same path twice.

This should be enough:

warn: Failed to reflink /home/pkolaczk/Projekty/fclones/test/foo.3 -> /home/pkolaczk/Projekty/fclones/test/foo.1: Operation not supported (os error 95)```
pkolaczk commented 3 years ago

Merged in #79

th1000s commented 3 years ago

This PR was still marked as a draft on purpose, but now it is done, see #80