olson-sean-k / wax

Opinionated and portable globs that can be matched against paths and directory trees.
https://glob.guide
MIT License
112 stars 10 forks source link

Suboptimal performance when using Walk::not #49

Closed arlyon closed 10 months ago

arlyon commented 11 months ago
let glob = make_glob("crates/**/*.rs");
let walk = glob.walk("/path/to/dir").not(vec![make_glob("crates/**/target")]);

Consider the code below

https://github.com/olson-sean-k/wax/blob/1afcda8318201afc04ebed06fea907d54fc1bf8c/src/walk.rs#L749-L767

Given the globs above, the walker will end up traversing into every single target folder and never trigger a skip. The 'skip tree' optimisation will never trigger because it is only able to skip a folder after it is yielded to the negation via self.input.next(), and most target folders will a) not have rust files to yield and b) if they do, will not match the exclude case because they are a file and not the folder itself. That means that we will not correctly detect when we can TreeIterator::skip_tree. Seems like we need to invert the order here so that the FilterTree is wrapped by the Walk so that it may reject items first. Alternatively we could modify the Walk pattern to include the excluded globs, so that they are yielded to the FilterTree for exclusion but that seems a little inelegant...

I believe that this will have no impact on the behaviour, but it will make walking with negations much more efficient.

arlyon commented 11 months ago

Note, it is possible to work around this by modifying the walk glob manually to match the behaviour I am describing, however globs are user supplied and so I cannot reasonably advise them to modify their globs.

- let glob = make_glob("crates/**/*.rs");
+ let glob = make_glob("crates/{target,**/*.rs}");
olson-sean-k commented 11 months ago

Patterns are never implicitly exhaustive. Just as crates/**/target will never positively match paths with names or components beyond target (for example, the path crates/wax/target/debug/ does not match), a negation will not negatively match paths with names or components beyond target either. As a negation, crates/**/target only discards paths that terminate in a component matching target anywhere beneath a component matching crates. The negation crates/**/target/** is probably what you want here if I understand correctly. I would expect that pattern to avoid the spurious reads.

I think the most common kind of negations that folks want must terminate in a tree wildcard ** like this. I've considered making this implicit, but that would force all negations to behave this way and I can imagine a need to avoid more specific paths, among other issues.

arlyon commented 11 months ago

Thanks for the quick reply! Using doublestar on the negation (crates/**/target/**) doesn't make a difference in my tests. I have attached some chrome traces of the various combinations. (not to scale, the importance is that big gap in the middle where we are neither yielding nor skipping folders).

crates/**/*.rs - crates/**/target/**

image

crates/**/*.rs - crates/**/target

image

crates/*/{target,**/*.rs} - crates/**/target/**

image

crates/*/{target,**/*.rs} - crates/**/target

image

The negation crates/**/target/** is probably what you want here if I understand correctly. I would expect that pattern to avoid the spurious reads.

You are partially correct, that does prevent rs files in target from being yielded (yay, part of the issue solved) but, in terms of performance, the issue is that if the Walk doesn't yield the target folder to the FilterTree then we don't get the chance to skip it and so we end up recursively walking the folder. If we end up finding a .rs file, the FilterTree will correctly filter it out but we still waste time walkin a folder we don't need to.

olson-sean-k commented 11 months ago

the issue is that if the Walk doesn't yield the target folder to the FilterTree then we don't get the chance to skip it

You're absolutely right! Filtering is not composed correctly and FilterTree only operates on the subset of items it receives from Walk. I'm working on a fix now.

olson-sean-k commented 10 months ago

After a bit of manual testing, I believe that I have a preliminary fix (currently on the filter branch). It introduces SeparatingFilters, which allow monotonic (one-way) filtering of a separation with special support for hierarchical tree iterators (like those reading directory trees from a file system) where all combinators can apply filter-mapping functions to the entire feed (both its filtrate and its residue).

This has introduced a significant amount of code, so I've reorganized the walk module and now expose it in the public API rather than re-exporting some of its items into the crate root. The walk module is now more substantial, but it preserves the current combinator API and its good ergonomics. 🎉

arlyon commented 10 months ago

Hello! Thank you for taking this on, you saved me a decent chunk of work :) I will pull down the branch and have a look.

olson-sean-k commented 10 months ago

I've landed a fix in 76afaa1 on master. :tada:

There are definitely some gaps in testing here, but this unit test should detect the underlying problem that caused this issue.