markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
805 stars 80 forks source link

Many readlink syscalls #325

Closed the8472 closed 10 months ago

the8472 commented 11 months ago

The "Gathering file list..." phase seems to take a lot of time and peeking at it with strace indicates that it's doing a ton of readlink syscalls for each file. I'm not sure what the purpose is but that can probably be optimized?

duperemove 0.13

stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._addcarry_u64.html", {st_mode=S_IFREG|0644, st_size=4809, ...}) = 0
readlink("/.snapshots", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._addcarryx_u64.html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._addcarryx_u64.html", {st_mode=S_IFREG|0644, st_size=4874, ...}) = 0
readlink("/.snapshots", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm512_madd52hi_epu64.html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm512_madd52hi_epu64.html", {st_mode=S_IFREG|0644, st_size=5791, ...}) = 0
readlink("/.snapshots", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
readlink("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm_madd52hi_epu64.html", 0x7fff734b12b0, 1023) = -1 EINVAL (Invalid argument)
stat("/.snapshots/36871/snapshot/home/the8472/.rustup/toolchains/nightly-2021-07-15-x86_64-unknown-linux-gnu/share/doc/rust/html/core/arch/x86_64/fn._mm_madd52hi_epu64.html", {st_mode=S_IFREG|0644, st_size=5729, ...}) = 0
the8472 commented 10 months ago

I think this is caused by this code since realpath uses readlink internally:

https://github.com/markfasheh/duperemove/blob/38675a90329f89218c36b6516315c5733201532e/file_scan.c#L315-L329

I don't quite understand why the sanitizing is necessary on every file. Instead only canonicalizing the starting points should do the job. Once that's an absolute, symlink-free path adding the descendants will remain so since directory traversal isn't following symlinks either.

JackSlateur commented 10 months ago

I think this is caused by this code since realpath uses readlink internally:

https://github.com/markfasheh/duperemove/blob/38675a90329f89218c36b6516315c5733201532e/file_scan.c#L315-L329

I don't quite understand why the sanitizing is necessary on every file. Instead only canonicalizing the starting points should do the job. Once that's an absolute, symlink-free path adding the descendants will remain so since directory traversal isn't following symlinks either.

You are correct: call from walk_dir() should not use realpath(3)

I am currently reworking the whole scan phase to fix many of its issues:

0.27 [jack:~/git/duperemove] git diff --stat origin/master | tail -1
 21 files changed, 1103 insertions(+), 1285 deletions(-)

I will work on this issue after that WIP is merged

the8472 commented 10 months ago

Cool, will the refactoring also reduce the memory usage (previous versions didn't use as much) or should I file a separate issue for that?

JackSlateur commented 10 months ago

Cool, will the refactoring also reduce the memory usage (previous versions didn't use as much) or should I file a separate issue for that?

Yes it does Last time I checked, on some extreme cases (for instance, the "large number of identical small files"), I had up to 90% memory reduction for the scan phase

JackSlateur commented 10 months ago

Hello @the8472 I believe your issue has been fixed in the latest release

Feel free to reopen if you still face the issue

Thank you for your contribution !