rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.54k stars 442 forks source link

add no_std regex, which depends on alloc #476

Closed BurntSushi closed 1 year ago

BurntSushi commented 6 years ago

There has been some interest in putting out a version of regex that doesn't depend on std itself, but instead depends just on alloc. This is within reach because regex already doesn't rely too much on platform specific details, and mostly just depends on dynamic memory allocation. There are however some parts of the regex API that will need to be tweaked. For example, regex uses std::io::Error, which isn't available in alloc. This is why regex 1.0 got a use_std/std feature. Namely, compiling regex without that feature fails today. This will allow us to change the semantics of that compilation mode without breaking backwards compatibility.

@ZackPierce has been diligently adding support for this by starting with regex's dependencies. So far:

I thought it would be good to track this issue at a higher level so we can discuss a game plan. I'd also like to share some of my thoughts/constraints on the process.

I basically think that we should do this. What I am unsure of is the timeline. For the most part, my own personal maintenance bandwidth is very limited, and to this end, I've generally avoided nightly-only support of things. (I've made some exceptions. Support for things like Pattern happened before I knew better, and support for SIMD happened because I am very excited about it and got involved with the SIMD stabilization effort.) Namely, I cannot and will not be beholden to nightly breakages because I simply can't keep up. To that end, I would like to know more about what the plan is for no_std environments. Is the alloc crate setup generally where we think we're going? If so, and if it remains relatively stable, I think I could get on board with this relatively soon.

It's also worth saying that some changes are simpler than others. For example, making utf8-ranges compatible with no_std is pretty reasonable, but the aho-corasick changes are quite a bit broader. I have concerns over peppering conditional compilation everywhere, and I think those sorts of things are very hard to maintain. I'm hopeful we can find a better way. This gets worse when the public API is impacted. The complexity and maintenance burden goes way up. This is apparently so bad that it was worth adding a new dependency (cfg-if) that must be paid for by everyone just to support the no_std users. I'm not especially excited about that, particularly if the trend continues.

For the most part, I really wasn't intending on tackling this feature until more stuff stabilized. But I wanted to get my thoughts out there so that there are no surprises.

I welcome other thoughts on the matter!

ZackPierce commented 6 years ago

My rough prediction is that the pattern of an alloc-like facade crate is not likely going to leave the mainstream of Rust no_std + heap development for general-purpose libraries. Why? Because even though it presently requires conditional imports, it doesn't mandate threading allocator type signatures or otherwise significantly impacting API design.

The details of implementing a global allocator that can be used with the global facade pattern are definitely in flux, but I've not seen dramatic change in how general-purpose libraries consume them recently.

As @BurntSushi rightly pointed out, the primary hassle here is maintaining all of those conditional-compilation flags.

I'll explore some more to see if there are any other tricks available for reducing their proliferation, and tear out cfg-if while I'm at it, if adding that macro is indeed too burdensome. Perhaps making heavier use of extern crate std as core could help. I'm definitely open to more ideas.

ZackPierce commented 6 years ago

Some good news is that the approach speculated in my prior comment seems to have borne some dividends, as seen by comparing the first and second commit of the no_std PR to regex-syntax. The original approach added ~230 lines, the latest approach adds only ~100.

Removing the cfg-if dependency and applying extern crate std as core; allowed unconditional imports for items found in both std and core (i.e. use core::mem). Thus, the number of duplicate imports differing only in crate prefix went way down.

BurntSushi commented 6 years ago

Nice. :) To clarify, I could probably stomach cfg-if itself. It is widely used and I trust its caretaker. But I try hard to be conservative here, especially with regex since it is so widely used. In my experience maintaining things, dependencies have generally become a liability. The only reason regex has as many dependencies as it does is because most of them would need to exist internally anyway, and it makes sense to expose them for others to benefit from.

ZackPierce commented 6 years ago

For completeness, I asked around the portability working group about the state of viable alternatives, and was pointed in the direction of the following documents:

The described end-state of thorough and flexible capability-aware portability is extremely appealing. It would encourage a consistent approach to configuration across libraries and probably reduce the redundantly-retargeted-use clauses to near zero.

That said, the overall approach of using human-applied cfg flags to tag and track the suitability of various portions of the codebase for targeted use cases and manage imports seems like it would be largely similar.

The working group seems to be in "design and ground-work" phase, clearing out various obstacles but not yet tackling the primary implementation. Thus, at present, I find it difficult to estimate the timeline involved before the vision approaches realization.

ZackPierce commented 6 years ago

After looking around a bit more, it seems like in some ways the community is voting with its commits.

Two other relatively high profile projects in the ecosystem -- rand and nom appear to be moving forward with the "std"-as-default-feature, "alloc"-as-optional-feature approach.

This gives me some confidence that either the strategy has momentum, or the projected maintenance burden for that pattern is tolerable.

Perhaps @Geal or @pitdicker or @dhardy wouldn't mind commenting?

pitdicker commented 6 years ago

Not sure what the exact issue is to reply to, but I'll try writing something.

I think supporting no_std certainly added quite some trouble for Rand. And it is not great yet in my opinion, as important functionality is simply not available.

One thing that does help is that Rand doesn't really require allocations. Adding the alloc feature was easy, compared to error handling, having no thread-local storage (still unsolved), no easy OS interface, and no floating point math functions.

dhardy commented 6 years ago

Thanks for all the references @ZackPierce; Aaron's "portability vision" article is why I suggested importing from std where possible in #477.

The size of #477 could probably be smaller still in my opinion, except some more changes may be needed in tests.

As @pitdicker mentions, as a result of this you end up with two test suites to run, i.e. cargo test and cargo test --no-default-features (plus cargo test --benches and maybe more).

CI doesn't need to be as complex as we have in Rand, though if you care about testing several different platforms you may want to take notes from Rand's Travis configuration.

Centril commented 6 years ago

@ZackPierce recently added no_std + alloc support in https://github.com/AltSysrq/proptest/pull/48. However, the crate depends on regex-syntax for some of its nice features. But since regex-syntax doesn't work without std, those features can't be used with no_std + alloc for proptest.

I'd like to replicate Zack's work in regex-syntax specifically. A preliminary analysis tells me that most of that crate's imports only use core, and so the changes won't need to be extensive.

I'll start working on a PR to this end. :)

ZackPierce commented 6 years ago

@Centril Thanks for the interest. There is already a PR doing this for regex-syntax. https://github.com/rust-lang/regex/pull/477

I've been hoping to get the chance to incorporate the latest recommendations for improvement of that PR. With luck, that'll happen tonight.

ZackPierce commented 6 years ago

Thanks you for chiming in with your experiences and suggestions, @dhardy , @pitdicker , and @Centril .

As of the latest updates based on those ideas, the exemplar PR to the regex-syntax crate now has roughly 25% of the code changes to the extant, operational portion of the codebase as when it was first attempted. Seems like a pretty respectable improvement in the cognitive load cost for maintenance to me.

BurntSushi commented 6 years ago

One other thing I thought about here is the regex's crate's use of the thread_local crate for cheap, synchronized, dynamic thread locals for caching regex match data. Making thread_local support a no_std mode (that is perhaps slower) would probably be the best path, but that looks non-trivial to me, and also fairly subtle. Another approach might be to figure out a simple alternative in the regex crate itself, even if it is not as optimal as what thread_local does.

BurntSushi commented 6 years ago

If anyone has specific application oriented use cases for regex in no_std w/ alloc, now would be a good time to elaborate on them here: https://github.com/rust-lang/rfcs/pull/2480#issuecomment-401930667

(I don't have any application oriented use cases myself. My goal here was to service others.)

Centril commented 6 years ago

(Idk if it counts as an application oriented use case or not, but proptest could benefit from making regex-syntax dependent functionality available to no_std + alloc users)

BurntSushi commented 6 years ago

@Centril I think I would just ask you to push the question forward, since proptest is itself a library. Who are the people specifically benefiting from proptest in no_std + alloc environments? What are their use cases? Constraints?

Centril commented 6 years ago

@BurntSushi I redirect to @ZackPierce since they introduced the no_std + alloc support to proptest :)

Some relevant discussion: https://github.com/AltSysrq/proptest/issues/47

harryfei commented 6 years ago

@BurntSushi We use Rust to write window kernel driver in our product. We want to use regex to construct our string match rules. RegexSet is what we want to use.

BurntSushi commented 6 years ago

Sorry, but that seems off topic for this issue? I don't understand what question, if any, you're asking me. If you need help with something, then please open a new issue and provide as much detail as possible about the problem you're trying to solve and your constraints.

harryfei commented 6 years ago

Sorry, I ought to make it cleaner.

I just described our use case for using regex in no_std. :smile:

BurntSushi commented 6 years ago

@harryfei Sorry, but I'm going to need a lot more details than that. I'm not a Windows programmer, and certainly have no experience with Windows kernel driver development, so I don't understand what your constraints are.

dhardy commented 6 years ago

Presumably the only relevant constraint is whether or not you have alloc?

harryfei commented 6 years ago

In kernel driver development, we must use no_std feature. Because there is no OS syscall as in the user mode, many std functions can't be used (just like the embed system). We can use regex crate only if it supports no_std feature.

BurntSushi commented 6 years ago

@harryfei Thanks for elaborating. I think you'll want to monitor https://github.com/rust-lang/rfcs/pull/2480. Once it stabilizes, then this is something I'd be willing to more aggressively pursue.

hargoniX commented 5 years ago

@harryfei @BurntSushi the alloc crate got stabilized in the last release of rustc so it might be worth pursueing this further as you mentioned before.

BurntSushi commented 5 years ago

Yes I know. It will likely be a while before I look into it. regex has a conservative MSRV.

Centril commented 5 years ago

Yes I know. It will likely be a while before I look into it. regex has a conservative MSRV.

Would it not be possible to use build.rs to conditionally depend on extern crate alloc;?

BurntSushi commented 5 years ago

Yes. regex already does that for things like SIMD. The key concern there is how complex it will make the code. If it's not crazy, then I'd definitely be up for conditionally enabling it.