sharkdp / fd

A simple, fast and user-friendly alternative to 'find'
Apache License 2.0
33.69k stars 805 forks source link

Way to find "all files not matching uid:gid" #1087

Open rmartine-ias opened 2 years ago

rmartine-ias commented 2 years ago

My use case is speeding up: chown -R uid:gid dir/ by replacing it with something like: find dir/ \( ! -uid uid -o ! -gid gid \) | xargs chown uid:gid.

I do not think this is possible with fd, because the owner flag uses and logic on the matches. So, if I say: fd --owner "!uid:!gid" . dir/,

I will successfully match only the files where BOTH uid and gid are wrong. I want to match the files where EITHER uid or gid are wrong, so I can change them. I can run it twice:

fd --owner "!uid" . dir/ | xargs chown uid:gid
fd --owner ":!gid" . dir/ | xargs chown uid:gid

but this seems suboptimal.

sharkdp commented 2 years ago

Thank you for the feedback.

I agree, it's probably not possible right now. Do you think there would be a (reasonably simple) update to the supported syntax for this option?

but this seems suboptimal.

It probably is. But here's one suggestion how you can speed up those commands significantly. Do not pipe to xargs but use fds --exec/-x option instead. This is not just shorter to type. It also runs those chown commands in parallel:

fd --owner "!uid" . dir/ -x chown uid:gid

If you really want to run everything sequentially in one single big chown command, you can still use --exec-batch/-X:

fd --owner "!uid" . dir/ -X chown uid:gid
rmartine-ias commented 2 years ago

Thank you for the reply, and the great tool!

Do you think there would be a (reasonably simple) update to the supported syntax for this option?

Adding an -or operator seems like overkill. Adding an "invert the entire find operation" operator seems like it would be a lot of work (though maybe useful in other cases?). Something like !(uid:gid) seems clear to me though. Then you can match:

Want Syntax
UID uid
GID :gid
!UID !uid
!GID :!gid
UID && GID uid:gid
UID && !GID uid:!gid
!UID && GID !uid:gid
!UID && !GID !uid:!gid
UID || GID !(!uid:!gid)
UID || !GID !(!uid:gid)
!UID || GID !(uid:!gid)
!UID || !GID !(uid:gid)

This leaves three terrible looking ones, unfortunately, but I can't think of a use case for them offhand. So, they could maybe not be supported. An alternative syntax could be !uid|!gid, replacing the colon with a pipe to denote an "or", which would make all of the variants possible to express simply. The last idea I had would be to have the option of providing the ability to do multiple --owner arguments, with an implicit OR between them -- I tried this before looking at the source, and hoped it would work, so that's a point in intuitiveness' favor.

Do not pipe to xargs but use fds --exec/-x option instead.

I did test -x, and for some reason it was slower than piping to xargs, by about 25x. It may be something wrong with my testing setup, I don't feel certain in it.

fd_xargs.sh:

#!/usr/bin/env bash
fd -0 --owner ":!100" . "$1" | xargs -0 chown 100:100
fd -0 --owner "!100" . "$1" | xargs -0 chown 100:100

fd_x.sh:

#!/usr/bin/env bash
fd --owner ":!100" . "$1" -x chown 100:100
fd --owner "!100" . "$1" -x chown 100:100

hyperfine --warmup 3 --prepare 'sudo rm -rf wd && cp -r mostly-right wd' 'sudo ./fd_x.sh wd' 'sudo ./fd_xargs.sh wd':

Benchmark 1: sudo ./fd_x.sh wd

  Time (mean ± σ):      1.993 s ±  0.060 s    [User: 1.628 s, System: 6.504 s]
  Range (min … max):    1.895 s …  2.101 s    10 runs

Benchmark 2: sudo ./fd_xargs.sh wd
  Time (mean ± σ):      82.5 ms ±   1.8 ms    [User: 43.7 ms, System: 170.1 ms]
  Range (min … max):    79.5 ms …  85.7 ms    10 runs

Summary
  'sudo ./fd_xargs.sh wd' ran
   24.17 ± 0.90 times faster than 'sudo ./fd_x.sh wd'

MacOS Monterey 12.5.1, M1 chip, fd 8.4.0. mostly-right is a directory of about ~2000 files, where ~20 of them are not 100:100. I also tested with a directory with none of the files having the correct uid:gid, and got the same results though.

Running batched (-X) provides identical performance to xargs.

Animeshz commented 1 year ago

I believe uid1|uid2|uid3:gid1|gid2 syntax would be best. All before & after colon : be user & groups respectively in OR as there can't be any AND (file can't be saved as owned by two users xD).

A !uid1|!uid2would be in AND as OR here will simply make it return full set. i.e. Any ! expression will be ANDed.

We wouldn't be accepting uid1|!uid2|uid3 as that would read as uid1 or uid3 but not uid2, which doesn't make any sense, uid2 couldn't overlap uid1 & uid3...

Final Grammar: uid1|uid2...:gid1|gid2... OR !uid1|!uid2...:gid1|gid2... OR uid1|uid2...:!gid1|!gid2... OR !uid1|!uid2...:!gid1|!gid2...

LMK what do you think on this...

horst5000 commented 1 year ago

Cool, I have EXACTLY the same use case as @rmartine-ias, and also thought about using fd for speeding up chown'ing lots of files. Happy to see that this is already being discussed.

tavianator commented 1 year ago

I'm not convinced the !uid check is actually helpful even in the best case. For short-lived tasks like this, in order from most to least expensive, we have process creation, syscalls, then everything else. So fd -x is likely to lose as it creates way too many processes. fd -X should be better, since it does a few syscalls per file and only creates one process. But chown -R does pretty much the same thing. With fd -X you get parallel traversal but then chown is serialized so it probably doesn't help much.

I think the most efficient thing is likely to be something like

# Do all the files below depth 2 at once
fd --max-depth 1 -X chown user:group
# Do all the subtrees at depth 2 in parallel
fd --exact-depth 2 -x chown -R user:group

Also, to get almost the semantics of !uid || !gid you can just run fd twice:

fd --owner '!uid' ...
fd --owner ':!gid' ...

You can get duplicates this way, but for running chown that doesn't matter.

gibzer commented 1 year ago

You could do like this:

fd --owner '!root:!root' . /dir | perl -lne 'chown(0, 0, $_)'

This prevents creating new processes of chown. chown(0, 0, $_) - first digit is numeric UID, second - numeric GID.

MagicMuscleMan commented 1 week ago

You can get duplicates this way, but for running chown that doesn't matter.

If you do not want to touch the ctime of the involved files, it does matter. I would even go so far and state that only solutions are correct which preserve the ctime if the uid:gid pair is identical before and after the fd call.

The proposal of @rmartine-ias would be correct by this definition.