tavianator / bfs

A breadth-first version of the UNIX find command
https://tavianator.com/projects/bfs.html
BSD Zero Clause License
968 stars 37 forks source link

RFE: child/parent operators #141

Open intelfx opened 2 weeks ago

intelfx commented 2 weeks ago

It would be an interesting addition to the GNU find syntax to have some sort of possibility to evaluate parts of the find expression in context of a child/parent file.

If that's too confusing, a few examples in pseudo-find syntax with the proposed extension:

  1. Exclude all directories containing a CACHEDIR.TAG file:

    find . -type d -child \( -type f -name CACHEDIR.TAG \) -prune -or ...
  2. Find all directories that look like a Borg repository:

    find . -type d -child \( -type f -name config -execdir grep -q -Fx '[repository]' {} \; \) -child \( -type d -name data \)

If this syntax is infeasible to implement efficiently due to requirement to perform nested iterations in general case, I can imagine another variant of this syntax:

find . -type d -child config \( -type f -execdir grep -q -Fx '[repository]' {} \; \) -child data \( -type d \)

In this case, the -child operator has two operands: (1) a string representing a specific child file name to examine, and (2) a subexpression that is evaluated at most one time against the specific file named by the first operand (or not at all if there is no such file).

tavianator commented 2 weeks ago

This is an interesting idea, and the kind of thing that's come up before (e..g https://github.com/tavianator/bfs/issues/92), where you want to match directories or not based on their contents.

There are a couple tricks that currently work:

This kinda reminds me of the :has() selector in CSS. It would be theoretically possible to implement

$ bfs -not -has \( -maxdepth 1 -name CACHEDIR.TAG \)

via a recursive bftw() call, and it could even share resources (ioq thread pool, open fd cache) with the parent call to be more efficient. But I'm not sure the complexity is worth it.

[^1]: You could use something like comm -z -23 <(bfs ... -print0 | sort -z) <(bfs ... -printf '%h\0' | sort -z) I guess, but that's pretty gross and still doesn't let you prune directories easily.

intelfx commented 2 weeks ago
  • You can sometimes match the child itself and then print the parent path. E.g. to match directories that do contain CACHEDIR.TAG:

    tavianator@graphene $ bfs -name CACHEDIR.TAG -printf '%h\n'
    ./.cargo/registry
    ./.cache/fontconfig
    ./.cache/pipx

That's precisely what I'm doing now; however, it doesn't allow to -prune these directories unless I'm mistaken?

  • You could use other commands as filters too, e.g.

    tavianator@graphene $ bfs -type d -exec sh -c '! test -e "$1/CACHEDIR.TAG"' sh {} \; -print

Yes, -exec test ... is also something I tried, but the fork/exec overhead becomes pretty prohibitive.


It would be theoretically possible to implement

$ bfs -not -has \( -maxdepth 1 -name CACHEDIR.TAG \)

via a recursive bftw() call, and it could even share resources (ioq thread pool, open fd cache) with the parent call to be more efficient

Nice! Yes, this is exactly what I was proposing. Thanks for the hint, I might try actually doing this because the lengths I have to go to to work around lack of this feature are not really pleasant.

But I'm not sure the complexity is worth it.

If that's too complex, perhaps the second, more limited form of this proposal (one that basically lets you do a -exec test ... in-process) would be acceptable?

$ bfs -not -has CACHEDIR.TAG \( -type f \)
tavianator commented 2 weeks ago

I think a nicer middle-ground might be

$ bfs -exclude -has-child \( -type f -name CACHEDIR.TAG \)

which would behave semantically like

$ bfs -exclude -has \( -mindepth 1 -maxdepth 1 -type f -name CACHEDIR.TAG \)

-has-child avoids the exponential complexity of unrestricted -has and can be implemented without patching bftw() at all, I believe. The cost of the extra flexibility is an extra readdir(), but I think it's probably worth it. (We could even have the optimizer detect a non-wildcard -name and convert the readdir() into stat(".../CACHEDIR.TAG") if it makes a big difference.)

Btw for correct CACHEDIR.TAG detection you should also be checking that the contents of the file starts with Signature: 8a477f597d28d172789f06886806bc55, according to https://bford.info/cachedir/, but that may not be worth it in practice.