Open intelfx opened 2 weeks ago
This is an interesting idea, and the kind of thing that's come up before (e..g https://github.com/tavianator/bfs/issues/92), where you want to match directories or not based on their contents.
There are a couple tricks that currently work:
You can sometimes match the child itself and then print the parent path. E.g. to match directories that do contain CACHEDIR.TAG
:
tavianator@graphene $ bfs -name CACHEDIR.TAG -printf '%h\n'
./.cargo/registry
./.cache/fontconfig
./.cache/pipx
To do further processing of these directories, you could use -printf '%h\0' | xargs -0 ...
. However, this doesn't help find directories that do not contain a matching file[^1].
Use -exec bfs ... -exit 1
as a filter. This is super inefficient, but it works:
tavianator@graphene $ bfs -type d -exec bfs -f {} -mindepth 1 -maxdepth 1 -name CACHEDIR.TAG -exit 1 \; -print
.
./Desktop
./Downloads
...
You could use other commands as filters too, e.g.
tavianator@graphene $ bfs -type d -exec sh -c '! test -e "$1/CACHEDIR.TAG"' sh {} \; -print
...
This kinda reminds me of the :has()
selector in CSS. It would be theoretically possible to implement
$ bfs -not -has \( -maxdepth 1 -name CACHEDIR.TAG \)
via a recursive bftw()
call, and it could even share resources (ioq thread pool, open fd cache) with the parent call to be more efficient. But I'm not sure the complexity is worth it.
[^1]: You could use something like comm -z -23 <(bfs ... -print0 | sort -z) <(bfs ... -printf '%h\0' | sort -z)
I guess, but that's pretty gross and still doesn't let you prune directories easily.
You can sometimes match the child itself and then print the parent path. E.g. to match directories that do contain
CACHEDIR.TAG
:tavianator@graphene $ bfs -name CACHEDIR.TAG -printf '%h\n' ./.cargo/registry ./.cache/fontconfig ./.cache/pipx
That's precisely what I'm doing now; however, it doesn't allow to -prune
these directories unless I'm mistaken?
You could use other commands as filters too, e.g.
tavianator@graphene $ bfs -type d -exec sh -c '! test -e "$1/CACHEDIR.TAG"' sh {} \; -print
Yes, -exec test ...
is also something I tried, but the fork/exec overhead becomes pretty prohibitive.
It would be theoretically possible to implement
$ bfs -not -has \( -maxdepth 1 -name CACHEDIR.TAG \)
via a recursive bftw() call, and it could even share resources (ioq thread pool, open fd cache) with the parent call to be more efficient
Nice! Yes, this is exactly what I was proposing. Thanks for the hint, I might try actually doing this because the lengths I have to go to to work around lack of this feature are not really pleasant.
But I'm not sure the complexity is worth it.
If that's too complex, perhaps the second, more limited form of this proposal (one that basically lets you do a -exec test ...
in-process) would be acceptable?
$ bfs -not -has CACHEDIR.TAG \( -type f \)
I think a nicer middle-ground might be
$ bfs -exclude -has-child \( -type f -name CACHEDIR.TAG \)
which would behave semantically like
$ bfs -exclude -has \( -mindepth 1 -maxdepth 1 -type f -name CACHEDIR.TAG \)
-has-child
avoids the exponential complexity of unrestricted -has
and can be implemented without patching bftw()
at all, I believe. The cost of the extra flexibility is an extra readdir()
, but I think it's probably worth it. (We could even have the optimizer detect a non-wildcard -name
and convert the readdir()
into stat(".../CACHEDIR.TAG")
if it makes a big difference.)
Btw for correct CACHEDIR.TAG
detection you should also be checking that the contents of the file starts with Signature: 8a477f597d28d172789f06886806bc55
, according to https://bford.info/cachedir/, but that may not be worth it in practice.
It would be an interesting addition to the GNU find syntax to have some sort of possibility to evaluate parts of the find expression in context of a child/parent file.
If that's too confusing, a few examples in pseudo-find syntax with the proposed extension:
Exclude all directories containing a
CACHEDIR.TAG
file:Find all directories that look like a Borg repository:
If this syntax is infeasible to implement efficiently due to requirement to perform nested iterations in general case, I can imagine another variant of this syntax:
In this case, the
-child
operator has two operands: (1) a string representing a specific child file name to examine, and (2) a subexpression that is evaluated at most one time against the specific file named by the first operand (or not at all if there is no such file).