rust-lang / glob

Support for matching file paths against Unix shell style patterns.
http://doc.rust-lang.org/glob
Apache License 2.0
447 stars 76 forks source link

Considering a roadmap for 1.0, or alternatively considering deprecating in favor of the "globset" crate #59

Open bstrie opened 7 years ago

bstrie commented 7 years ago

Per the original rust-lang-nursery RFC ( https://github.com/rust-lang/rfcs/blob/master/text/1242-rust-lang-crates.md ):

If the library team accepts a crate into the nursery, they are indicating an interest in ultimately advertising the crate as "a core part of Rust", and in maintaining the crate permanently. During the 0.X series in the nursery, the original crate author maintains control of the crate, approving PRs and so on, but the library subteam and broader community is expected to participate. Eventually, a nursery crate will either fail (and move to rust-lang-deprecated) or reach a point where a 1.0 release would be appropriate. If, on the other hand, a library reaches the 1.0 point, it is ready to be promoted into rust-lang proper. To do so, an RFC must be written outlining the motivation for the crate, the reasons that community ownership are important, and delving into the API design and its rationale design.

I'm not sure who the "original crate author" of glob is, as I'm guessing this crate was split out from the stdlib during the great pre-1.0 purge, but it doesn't appear that there's any real champion trying to advance this crate through the stabilization process.

I'm interested in starting a discussion as to the future of this crate. Sadly I can't find the original petition arguing why this crate belongs in the nursery in the first place, so I'm not even sure how to begin evaluating the aspirations of this crate that keep it from being 1.0 today.

Let's focus on answering three concrete questions:

  1. Does glob matching deserve to be "a core part of Rust", per RFC 1242? If not, then this repository should be moved to rust-lang-deprecated. But if so, then:
  2. Is this implementation of glob matching superior to other existing solutions that could be promoted to "official" crates (in particular, globset ( https://crates.io/crates/globset ) appears to be promising). If not, we should deprecate glob and consider promoting the alternative solution to rust-lang-nursery. But if so, then:
  3. Does anyone care enough about this crate to push it over the finish line? If so, then they should produce a prospective 1.0 RFC. But if not, then I think it sets a bad precedent to allow a crate to languish in the nursery forever, and this crate should be deprecated until such time as it finds a champion.

I don't consider myself an expert in either crate, but the biggest potential pitfall that I see when comparing glob to globset is that the latter relies on the complete regex crate, and transitively all of regex's dependencies. In contrast, glob is small self-contained, but appears to be less featureful than globset, and I'm unsure if it handles both Unicode and non-Unicode as well as the regex crate. Some microbenchmarks would be interesting to see as well.

Tagging @burntsushi, as the author of globset and who's familiar with the rust-lang-nursery process. And tagging @alexcrichton, who I'm guessing doesn't particularly care about this crate but who's the only active point of contact listed for this crate on crates.io. :P

BurntSushi commented 7 years ago

OK, I'm just going throw a bunch of stuff against the wall and see what sticks.

The primary motivation behind globset is in the name itself: I need a way to match many (possibly hundreds) of globs against a single file path very quickly. The naive solution of iterating over each glob and checking if it matches simply doesn't fly (the time complexity itself is not good enough). For that reason, and because RegexSet is a thing, it made a lot of sense to do the "convert the glob to a regex" dance. However, it doesn't stop there. It turns out that the regex engine doesn't optimize common globs as well as it could, so I ended up developing a set of matching strategies that make glob matching even faster. I dare say that most globs don't touch the regex engine at all. But when they do, they're still reasonably fast.

For that reason alone, you might consider globset to be relatively niche. It's important and the performance matters, but perhaps it's not common to need that kind of performance at that kind of scale.

With that said, it turns out that if you're implementing glob matching for a set of globs, it's also pretty easy to implement straight-forward single-glob matching as well. So globset provides that. I also made sure to use most (all?) of the test suite from the glob crate inside of globset (and then some), so the functionality should be roughly equivalent.

Tangentially, the glob crate can't match file paths that don't fit into a &str, and this turns out to be a real problem that people face in practice (there are some issues on the ripgrep repo for it, IIRC). I'm not familiar enough with the implementation in glob to know how easy or hard it is to fix that, but from a brief glance, it looks like the matcher pretty much assumes valid UTF-8 in that it's based on char. So it would probably require some sizeable attention to fix that.

I personally don't have the bandwidth to fix the glob crate any time soon. The globset crate is something that must exist for ripgrep to be as fast as it is, so regardless of what we do with glob, I think it's somewhat safe to say that globset will continue to work. However, here are some caveats:

  1. Since globset uses regex, as @bstrie points out, using it requires pulling in regex. regex has (perhaps deservedly) gotten a reputation for being a big dep because it takes a long time to compile.
  2. Since it uses regex, there is some overhead associated with compiling the glob since it requires compiling a regex. It's probably more expensive to compile a regex than it is to compile a glob in the glob crate, although I haven't benchmarked it.
  3. In my incomplete experiments, the matching performance of globset for a single glob is roughly on par with the glob crate, and tends to get better as the paths get longer.
  4. globset doesn't (yet) have an analogous glob function that returns an iterator over paths that match.
  5. globset doesn't support the require_literal_leading_dot option, mostly because it seemed a little tricky to implement, but I honestly haven't given it much thought.

4 and 5 seem like things that can be fixed. 1 and 2 are things that are probably just facts of life for the globset crate. 3 doesn't seem like a blocker.

So I think there's a key trade off here... The globset crate brings significantly more implementation complexity and larger compile times, but also fixes some of shortcomings of the glob crate as it is.

bstrie commented 7 years ago

@BurntSushi, Thanks for the feedback!

I personally don't have the bandwidth to fix the glob crate any time soon.

Indeed, I absolutely did not want to give the impression that I wanted to saddle you with any more work. :P Ideally I'd like our crate-elevation process to distribute the work among interested contributors by promising long-term relevance and the chance to contribute to a strategic dependency of the Rust ecosystem. It's just that to bootstrap that process we need somebody who gives a care, and I don't know that we have any such person for glob (and no document arguing why this might be strategically important to begin with, to inspire someone to step up).

alexcrichton commented 7 years ago

I would personally be on board transitioning community mindshare to globset. The size of the dependency is a downside, but it's not unique to globset (applies to many other crates as well). I think we've got various mitigation strategies here as well such as global caching, incremental compilation, etc.

If we'd like to retain the name glob it seems like we could "merge repos and bump major version" or somehow just transfer ownership of the name as well. I also don't mind just "deprecating" glob, in favor of globset.

Overall I don't have too many opinions, unfortunately. There's been very little activity on this repo over time, although I don't know if that's an indicator of "it's good enough" or "it's not used".

Keats commented 7 years ago

I was just looking for how to do *.{foo, bar} with glob so I would personally welcome using globset. I don't mind the compilation time/extra cost if I can get a more powerful globbing mechanism. If it is an issue, the previous versions are still on crates.io.

I think the glob name is better than globset.

Keats commented 6 years ago

Any update on that? Should a post be created on the internals/users forum?

BurntSushi commented 6 years ago

@Keats I think everything I said above is still true. Notably, the globset crate is still missing a couple of pieces of functionality that exist in the glob crate.

KodrAus commented 6 years ago

There's been some activity in the last few months, https://github.com/BurntSushi/ripgrep/pull/765 has lead to the emergence of the globwalk crate to provide a glob function over the globset API.

Perhaps once that's had time to bake we could look at deprecating glob in favour of globset + globwalk?

cc @Gilnaa

rivy commented 5 years ago

It looks like globwalk, just after v0.2.0, moved away from globset to using ignore in mid-April 2018 (see Gilnaa/globwalk@551ca14da218c5e507a90303e8ab6009b1b564d5). That, unfortunately, makes the discussion even more confusing.

Unfortunately, by globwalk moving to using ignore, windows users are generally being left out. The new combination doesn't support case-INsensitivity (which is the windows default).

It'd be nice to have a full featured, portable, well-tested library for globbing. I'm currently using wild, which depends on the older, semi-deprecated glob, to add globbing for windows CLI apps (see uutils/coreutils#1281). Unfortunately, there are some rough edges. And, informed by this discussion, I can't find a clear path forward to contribute to fixing the issues.

So anyway, my current vote would be to move forward in the globset / globwalk direction.

If someone starts pushing forward in a single direction, I'm happy to help with Windows testing (my coding skills in the rust environment are too naive for anything else right now).

BurntSushi commented 5 years ago

@rivy The ignore crate supports case insensitive globbing: https://docs.rs/ignore/0.4.3/ignore/overrides/struct.OverrideBuilder.html#method.case_insensitive

Gilnaa commented 5 years ago

I'll make sure this option is available through the GlobWalk API. Is it reasonable for this to be on by default on windows, or is it too surprising.

BurntSushi commented 5 years ago

@Gilnaa I'm not sure about the default on Windows. I think you'll probably want to look at what other glob walkers do. Worst case, start conservative (case sensitive with opt-in case insensitive), and move to case insensitive by default if you get a lot of feedback asking for it.

rivy commented 5 years ago

@Gilnaa , from what I've seen / used, most libraries default to case sensitive matching.

However, when compiling applications for Windows targets, to minimize surprise for Windows users, those applications should default to utilizing globwalk matching in a case-INsensitive manner. I do think it's fine to leave the onus of setting that up onto the application using globwalk.

So, IMO, as long as there's an easy way to include/use the library with case-sensitivity set OFF, and an easy way to toggle case-sensitivity, I'd leave the basic/normal default to case-sensitivity being ON.

Thanks for working on it!

Gilnaa commented 5 years ago

@rivy Sorry for keeping quiet, I've added support a few weeks ago https://github.com/Gilnaa/globwalk/commit/53fafd9442d76db16e0568cd71244f104603b1f6

Keats commented 5 years ago

However, when compiling applications for Windows targets, to minimize surprise for Windows users, those applications should default to utilizing globwalk matching in a case-INsensitive manner. I do think it's fine to leave the onus of setting that up onto the application using globwalk.

That would mean every person using globwalk needs to be aware of that, which isn't going to be the case. This default should be in globwalk directly and applications should be able to override it if necessary imo.

KodrAus commented 5 years ago

As one example, I believe .NET's Directory.EnumerateFiles API will use a platform-specific default. So it's case-insensitive on Windows and case-sensitive on Linux.

In my opinion, the current always case insensitive by default everywhere is totally fine. Different usecases can call for different case sensitivity irrespective of the platform, so I usually find myself explicitly needing one way or the other anyway.

BurntSushi commented 4 years ago

OK, so my plan at the moment---assuming all goes according to plan---is to rewrite globset. In particular, it will continue to support the use case of matching a set of globs, but I plan to make the dependency on regex optional. As in, a default "slow" matcher will be provided, but one can enable a perf feature (or similar) to bring regex in for faster glob matching. I also plan to:

One possible alternative to the above approach that I'd be willing to do is to take over glob itself, do what I said above, but for glob, not globset, and then deprecate globset. This way, we'd have one globbing library that everyone can centralize around. However, it would mean a total and complete break from the current glob crate as it exists today. My sense, though, is that folks will be fine with this because the crate is pretty much limping along right now. (And that's only thanks to @KodrAus's generous maintenance effort. <3)

cc'ing some of the library team for their thoughts @alexcrichton @KodrAus @dtolnay @SimonSapin @sfackler

alexcrichton commented 4 years ago

That all sounds fantastic to me, and I think it's fine to move glob to 1.0 with a breaking change

BurntSushi commented 4 years ago

I'd probably do a 0.4 release first if that's okay, in order to make sure whatever API I come up with is tested a bit before committing.

KodrAus commented 4 years ago

One possible alternative to the above approach that I'd be willing to do is to take over glob itself, do what I said above, but for glob, not globset, and then deprecate globset.

Personally, I like the idea of making the most of the glob name that we've already got here so an a fan of this path :+1: If we went this way would we want to work in this repository? If so maybe we could create a 0.3.x branch for the current source and revamp master? I'd be keen to get on board with stuff like supporting docs for migrating from the current glob to wherever we land if you think it'd be helpful!

BurntSushi commented 4 years ago

I think that sounds reasonableish. My plan is to just do what I said above, and then dump it on to master in this repo and put out a release shortly thereafter. At that point, we can create a branch for the old 0.3.x release, although I'm not sure how much utility that serves. I'm not sure it's worth the maintenance time and effort to continue supporting 0.3.x.

I'd be keen to get on board with stuff like supporting docs for migrating from the current glob to wherever we land if you think it'd be helpful!

Sounds good, but hopefully we shouldn't need too much here. It's going to be a complete break from the current API, so folks will probably just want to treat it as if it's a new crate and re-familiarize themselves with the new API. We can of course provide some common examples of how to translate old code. Although I mostly expect the top-level glob function to remain!

pablosichert commented 3 years ago

Hey @BurntSushi, thanks a lot for the globset/walkdir crates and laying out a plan to move forward with the glob crate.

I wonder if you had time work on the rewrite of globset and could give an update on the progress? If you are already at a point where more incremental changes can be made, I would be happy to pick up outstanding tasks.

For Vector, we would like to add a glob-based file walker and contribute to upstream projects where suitable/necessary. Since the situation seems to be spread out between the glob, globset and globwalk crates, it would be nice to know where it would be best to direct our efforts.

Judging from https://github.com/BurntSushi/ripgrep/pull/765#issuecomment-361419059, https://github.com/Gilnaa/globwalk/issues/28#issuecomment-742099252 and this very issue it seems like converging all functionality into globset and replacing glob with that would still be the optimal course of action.