Open bstrie opened 7 years ago
OK, I'm just going throw a bunch of stuff against the wall and see what sticks.
The primary motivation behind globset
is in the name itself: I need a way to match many (possibly hundreds) of globs against a single file path very quickly. The naive solution of iterating over each glob and checking if it matches simply doesn't fly (the time complexity itself is not good enough). For that reason, and because RegexSet
is a thing, it made a lot of sense to do the "convert the glob to a regex" dance. However, it doesn't stop there. It turns out that the regex engine doesn't optimize common globs as well as it could, so I ended up developing a set of matching strategies that make glob matching even faster. I dare say that most globs don't touch the regex engine at all. But when they do, they're still reasonably fast.
For that reason alone, you might consider globset
to be relatively niche. It's important and the performance matters, but perhaps it's not common to need that kind of performance at that kind of scale.
With that said, it turns out that if you're implementing glob matching for a set of globs, it's also pretty easy to implement straight-forward single-glob matching as well. So globset
provides that. I also made sure to use most (all?) of the test suite from the glob
crate inside of globset
(and then some), so the functionality should be roughly equivalent.
Tangentially, the glob
crate can't match file paths that don't fit into a &str
, and this turns out to be a real problem that people face in practice (there are some issues on the ripgrep repo for it, IIRC). I'm not familiar enough with the implementation in glob
to know how easy or hard it is to fix that, but from a brief glance, it looks like the matcher pretty much assumes valid UTF-8 in that it's based on char
. So it would probably require some sizeable attention to fix that.
I personally don't have the bandwidth to fix the glob
crate any time soon. The globset
crate is something that must exist for ripgrep to be as fast as it is, so regardless of what we do with glob
, I think it's somewhat safe to say that globset
will continue to work. However, here are some caveats:
globset
uses regex
, as @bstrie points out, using it requires pulling in regex
. regex
has (perhaps deservedly) gotten a reputation for being a big dep because it takes a long time to compile.regex
, there is some overhead associated with compiling the glob since it requires compiling a regex. It's probably more expensive to compile a regex than it is to compile a glob in the glob
crate, although I haven't benchmarked it.globset
for a single glob is roughly on par with the glob
crate, and tends to get better as the paths get longer.globset
doesn't (yet) have an analogous glob
function that returns an iterator over paths that match.globset
doesn't support the require_literal_leading_dot
option, mostly because it seemed a little tricky to implement, but I honestly haven't given it much thought.4 and 5 seem like things that can be fixed. 1 and 2 are things that are probably just facts of life for the globset
crate. 3 doesn't seem like a blocker.
So I think there's a key trade off here... The globset
crate brings significantly more implementation complexity and larger compile times, but also fixes some of shortcomings of the glob
crate as it is.
@BurntSushi, Thanks for the feedback!
I personally don't have the bandwidth to fix the glob crate any time soon.
Indeed, I absolutely did not want to give the impression that I wanted to saddle you with any more work. :P Ideally I'd like our crate-elevation process to distribute the work among interested contributors by promising long-term relevance and the chance to contribute to a strategic dependency of the Rust ecosystem. It's just that to bootstrap that process we need somebody who gives a care, and I don't know that we have any such person for glob (and no document arguing why this might be strategically important to begin with, to inspire someone to step up).
I would personally be on board transitioning community mindshare to globset
. The size of the dependency is a downside, but it's not unique to globset
(applies to many other crates as well). I think we've got various mitigation strategies here as well such as global caching, incremental compilation, etc.
If we'd like to retain the name glob
it seems like we could "merge repos and bump major version" or somehow just transfer ownership of the name as well. I also don't mind just "deprecating" glob
, in favor of globset
.
Overall I don't have too many opinions, unfortunately. There's been very little activity on this repo over time, although I don't know if that's an indicator of "it's good enough" or "it's not used".
I was just looking for how to do *.{foo, bar}
with glob
so I would personally welcome using globset
.
I don't mind the compilation time/extra cost if I can get a more powerful globbing mechanism. If it is an issue, the previous versions are still on crates.io.
I think the glob
name is better than globset
.
Any update on that? Should a post be created on the internals/users forum?
@Keats I think everything I said above is still true. Notably, the globset
crate is still missing a couple of pieces of functionality that exist in the glob
crate.
There's been some activity in the last few months, https://github.com/BurntSushi/ripgrep/pull/765 has lead to the emergence of the globwalk
crate to provide a glob
function over the globset
API.
Perhaps once that's had time to bake we could look at deprecating glob
in favour of globset
+ globwalk
?
cc @Gilnaa
It looks like globwalk
, just after v0.2.0
, moved away from globset
to using ignore
in mid-April 2018 (see Gilnaa/globwalk@551ca14da218c5e507a90303e8ab6009b1b564d5). That, unfortunately, makes the discussion even more confusing.
Unfortunately, by globwalk
moving to using ignore
, windows users are generally being left out. The new combination doesn't support case-INsensitivity (which is the windows default).
It'd be nice to have a full featured, portable, well-tested library for globbing. I'm currently using wild
, which depends on the older, semi-deprecated glob
, to add globbing for windows CLI apps (see uutils/coreutils#1281). Unfortunately, there are some rough edges. And, informed by this discussion, I can't find a clear path forward to contribute to fixing the issues.
So anyway, my current vote would be to move forward in the globset
/ globwalk
direction.
If someone starts pushing forward in a single direction, I'm happy to help with Windows testing (my coding skills in the rust
environment are too naive for anything else right now).
@rivy The ignore crate supports case insensitive globbing: https://docs.rs/ignore/0.4.3/ignore/overrides/struct.OverrideBuilder.html#method.case_insensitive
I'll make sure this option is available through the GlobWalk API. Is it reasonable for this to be on by default on windows, or is it too surprising.
@Gilnaa I'm not sure about the default on Windows. I think you'll probably want to look at what other glob walkers do. Worst case, start conservative (case sensitive with opt-in case insensitive), and move to case insensitive by default if you get a lot of feedback asking for it.
@Gilnaa , from what I've seen / used, most libraries default to case sensitive matching.
However, when compiling applications for Windows targets, to minimize surprise for Windows users, those applications should default to utilizing globwalk
matching in a case-INsensitive manner. I do think it's fine to leave the onus of setting that up onto the application using globwalk
.
So, IMO, as long as there's an easy way to include/use the library with case-sensitivity set OFF, and an easy way to toggle case-sensitivity, I'd leave the basic/normal default to case-sensitivity being ON.
Thanks for working on it!
@rivy Sorry for keeping quiet, I've added support a few weeks ago https://github.com/Gilnaa/globwalk/commit/53fafd9442d76db16e0568cd71244f104603b1f6
However, when compiling applications for Windows targets, to minimize surprise for Windows users, those applications should default to utilizing globwalk matching in a case-INsensitive manner. I do think it's fine to leave the onus of setting that up onto the application using globwalk.
That would mean every person using globwalk
needs to be aware of that, which isn't going to be the case. This default should be in globwalk directly and applications should be able to override it if necessary imo.
As one example, I believe .NET's Directory.EnumerateFiles
API will use a platform-specific default. So it's case-insensitive on Windows and case-sensitive on Linux.
In my opinion, the current always case insensitive by default everywhere is totally fine. Different usecases can call for different case sensitivity irrespective of the platform, so I usually find myself explicitly needing one way or the other anyway.
OK, so my plan at the moment---assuming all goes according to plan---is to rewrite globset
. In particular, it will continue to support the use case of matching a set of globs, but I plan to make the dependency on regex
optional. As in, a default "slow" matcher will be provided, but one can enable a perf
feature (or similar) to bring regex
in for faster glob matching. I also plan to:
globset
currently lacks). I think this will also be a crate feature, but one that is enabled by default, as it will likely pull in a dependency on walkdir
. My intent is that this will be the only dependency incurred by default.glob
and globset
.glob
and globset
today, as well as probably all of the features made available by libc's glob
API (and also look at fnmatch
).globset
its own repository separate from ripgrep as a proper stand-alone repository, along with a complete API refresh.globset
, and lift some restrictions. i.e., Support nested brace syntax.One possible alternative to the above approach that I'd be willing to do is to take over glob
itself, do what I said above, but for glob
, not globset
, and then deprecate globset
. This way, we'd have one globbing library that everyone can centralize around. However, it would mean a total and complete break from the current glob
crate as it exists today. My sense, though, is that folks will be fine with this because the crate is pretty much limping along right now. (And that's only thanks to @KodrAus's generous maintenance effort. <3)
cc'ing some of the library team for their thoughts @alexcrichton @KodrAus @dtolnay @SimonSapin @sfackler
That all sounds fantastic to me, and I think it's fine to move glob
to 1.0 with a breaking change
I'd probably do a 0.4
release first if that's okay, in order to make sure whatever API I come up with is tested a bit before committing.
One possible alternative to the above approach that I'd be willing to do is to take over glob itself, do what I said above, but for glob, not globset, and then deprecate globset.
Personally, I like the idea of making the most of the glob
name that we've already got here so an a fan of this path :+1: If we went this way would we want to work in this repository? If so maybe we could create a 0.3.x
branch for the current source and revamp master
? I'd be keen to get on board with stuff like supporting docs for migrating from the current glob
to wherever we land if you think it'd be helpful!
I think that sounds reasonableish. My plan is to just do what I said above, and then dump it on to master in this repo and put out a release shortly thereafter. At that point, we can create a branch for the old 0.3.x
release, although I'm not sure how much utility that serves. I'm not sure it's worth the maintenance time and effort to continue supporting 0.3.x
.
I'd be keen to get on board with stuff like supporting docs for migrating from the current glob to wherever we land if you think it'd be helpful!
Sounds good, but hopefully we shouldn't need too much here. It's going to be a complete break from the current API, so folks will probably just want to treat it as if it's a new crate and re-familiarize themselves with the new API. We can of course provide some common examples of how to translate old code. Although I mostly expect the top-level glob
function to remain!
Hey @BurntSushi, thanks a lot for the globset
/walkdir
crates and laying out a plan to move forward with the glob
crate.
I wonder if you had time work on the rewrite of globset
and could give an update on the progress? If you are already at a point where more incremental changes can be made, I would be happy to pick up outstanding tasks.
For Vector, we would like to add a glob-based file walker and contribute to upstream projects where suitable/necessary. Since the situation seems to be spread out between the glob
, globset
and globwalk
crates, it would be nice to know where it would be best to direct our efforts.
Judging from https://github.com/BurntSushi/ripgrep/pull/765#issuecomment-361419059, https://github.com/Gilnaa/globwalk/issues/28#issuecomment-742099252 and this very issue it seems like converging all functionality into globset
and replacing glob
with that would still be the optimal course of action.
Per the original rust-lang-nursery RFC ( https://github.com/rust-lang/rfcs/blob/master/text/1242-rust-lang-crates.md ):
I'm not sure who the "original crate author" of glob is, as I'm guessing this crate was split out from the stdlib during the great pre-1.0 purge, but it doesn't appear that there's any real champion trying to advance this crate through the stabilization process.
I'm interested in starting a discussion as to the future of this crate. Sadly I can't find the original petition arguing why this crate belongs in the nursery in the first place, so I'm not even sure how to begin evaluating the aspirations of this crate that keep it from being 1.0 today.
Let's focus on answering three concrete questions:
I don't consider myself an expert in either crate, but the biggest potential pitfall that I see when comparing glob to globset is that the latter relies on the complete regex crate, and transitively all of regex's dependencies. In contrast, glob is small self-contained, but appears to be less featureful than globset, and I'm unsure if it handles both Unicode and non-Unicode as well as the regex crate. Some microbenchmarks would be interesting to see as well.
Tagging @burntsushi, as the author of globset and who's familiar with the rust-lang-nursery process. And tagging @alexcrichton, who I'm guessing doesn't particularly care about this crate but who's the only active point of contact listed for this crate on crates.io. :P