Open fenollp opened 5 years ago
After time
and json
,
time https://github.com/google/starlark-go/tree/wip-skylark-time json https://github.com/google/starlark-go/tree/json
regexp
is the most requested package for the Starlark "standard library". No-one has sketched a prototype yet, but it shouldn't be hard to come up with a sound Starlark API for regular expressions that can be backed by the Go, Java, and C++ implementations of RE2, which is linear time (non-backtracking; see github.com/google/re2), though we will need to disallow \C
.
I imagine the basic API would look something like the Go regexp
API:
load("regexp")
rx = regexp.compile("b(an)*a")
rx.matches(str src) # returns bool
rx.find(str src) # returns text of match
rx.find_all(str src, int max) # returns list of matches
rx.replace_all(str src , str or function) # returns replaced text
rx.find_submatches(str) # returns list of complete match and each subpattern match
We might also want variants that return the indices (not text) of each match.
Any update on this, yet?
Sorry, no.
For what it's worth, I stumbled on https://godoc.org/github.com/qri-io/starlib/re
@b5 What about this one? Ready for it? :-)
I think this would be great to get to attack next given the roll we're on with the time
and math
packages, but the design & implementation of any regex package or language level support requires more of upfront design discussion:
regexp
or python's re
?My answers to these questions:
re
style, but I think sitting down & working through a minimal design draft is in order.regexp
would be a mistake. But with the choice to use regexp
it's important to document any differences for users coming from python.That said, I'd love to get input from others. Would a re
package suit your needs? Any reason we should instead be dedicating time to writing a regex proposal for language inclusion instead?
@alandonovan, would you accept a package if we could get one to pass muster? What's your bandwidth like for either review and/or design input?
🤦 I literally hadn't read the comments above where @adonovan laying out answers to all of my questions. Ok @essobedo I think we can take a stab at implementing the prototype laid out above & get a PR going from there.
I actually implemented this for the Java version using re2. It works really well for our use case. You can see it here:
It exposed a bug in re2j implementation (PR is coming soon-ish).
I actually implemented this for the Java version using re2. It works really well for our use case. You can see it here: ...
Interesting, thanks.
It's funny, I wrote RE2/J a decade back for use in the regexp package of a Google in-house embedded configuration language that had implementations in multiple languages and needed a consistent regexp interface. That description now applies equally to Starlark, and it's using RE2/J for the same reason.
BTW, although RE2/J is asymptotically faster than j.u.regex, in practice it's usually significantly slower. Not that that really matters here.
It appear that you are using Starlark as a "secure" embedded language within your VGS product. By design, Starlark is hermetic in many ways, but we've never claimed that it is secure for running untrusted code. Scripts can easily cause denial of service by exhausting all memory, or by hash flooding. Also, RE2/J can be induced to run out of stack when parsing a deep expression such as "(((...(((". (The Go version is somewhat less susceptible to such attacks because Go stacks can grow 100x larger than JVM threads.)
Scripts can easily cause denial of service by exhausting all memory, or by hash flooding
Yes, I'm aware of this (actually noticed this discussion re: hash flooding in the Fnv32 hash function discussions in the bytes PR :D). We've enlisted Trail of Bits for a code audit (and we'll be happy to contribute whatever findings upstream) and what you've mentioned so far was on our roadmap. This isn't in production yet, not until we've checked all the bells and whistles, but it's actually a pretty neat idea if it works for what we want to use it for.
Also, RE2/J can be induced to run out of stack when parsing a deep expression such as "(((...(((". (The Go version is somewhat less susceptible to such attacks because Go stacks can grow 100x larger than JVM threads.)
I was not aware of this problem, that's good to know, thank you!
@b5 please, let me know if you cannot work on this feature because if so, no worries, I can do it alone this time 😄
@essobedo sounds great! I'd be delighted to jump on peer review if you can write the implementation! I think it'll get done much quicker that way 😊
@b5 If so, may I rely on what you have done for https://github.com/qri-io/starlib/commits/master/re?
Some preliminary questions:
\C
) are not supported, since these are infeasible to implement in Java or other languages whose strings are encoded as UTF-16.@b5 If so, may I rely on what you have done for https://github.com/qri-io/starlib/commits/master/re?
Absolutely I see two ways to get started:
Either works for me, just let me know which you'd prefer @essobedo.
@adonovan:
What regexp syntax are implementations to use? I think the only sensible answer, for reasons of portability and asymptotic efficiency, is RE2. We should link to the RE2 Syntax page.
Very much agreed here. The PR we file should use RE2 syntax and link to current spec in package documentation.
We should also make clear that byte-oriented features (i.e. just \C) are not supported, since these are infeasible to implement in Java or other languages whose strings are encoded as UTF-16.
Noted. Bonus points if we can provide a clear error when a user attempts to construct a regex with \C
.
Should we offer separate compile/match operations?
I think we should go the interpreted language route, at least at first. It's a non-breaking change to add regex compilation later. Python's re
package offers compile
.
My question for you @adonovan: can we compile expressions on .Freeze()
?
Do we need to offer variants of the match function that return indices, not substrings? What does Python do here?
I need to do a little more digging on this & report back. re.Match
objects (the result of calling re.match
) can be addressed by index. Your initial package API from above in this thread does not spec any kind of Match
return type, which may allow us to sidestep the issue at first?
We should offer a function to quote RE metacharacters so that quote(s) is a regexp matching exactly and only s.
Noted!
Another thing to consider long-term is a cache of compiled regular expressions, my question here is more about where that should be stored, and establishing a policy on how packages should handle something like a cache. I'm assuming this should be stored on the starlark thread. I don't think this is needed for an initial implementation, but would be great to get some initial guidance
Absolutely I see two ways to get started:
- If you'd like to file a PR into github.com/qri-io/starlib I can promise a through review there before we open a PR into this repo.
- We can use the same workflow we've been using, I'll open an initial PR that you can take over.
The way #2 would be great if it is possible for you
can we compile expressions on .Freeze()?
I'm not sure I understand what you're getting at. Patterns are immutable, so their freeze method is a no-op. And compilation can fail, so you need to report the error immediately. Also, you can compile and run a pattern in the same module, before freeze comes into play.
Your confusion answers my question 😄. I was having trouble building a mental model of the point where compilation is happening
Before moving forward, I think we should double check that this is the API we'd like to target for an initial package. This is @adonovan's suggested API, and the one I think we should go with:
load("regexp")
rx = regexp.compile("b(an)*a")
rx.matches(str src) # returns bool
rx.find(str src) # returns text of match
rx.find_all(str src, int max) # returns list of matches
rx.replace_all(str src , str or function) # returns replaced text
rx.find_submatches(str) # returns list of complete match and each subpattern match
I made a mistake in my earlier comment regarding compile/match
. I should have said that I think se can slip the match operations at first, and just use this compile-only API to start with.
This has many parts:
One can definitely implement the first 2 with minimum friction, but as I find Starlark's strength to be in its Turing-incompleteness guarantees I am wondering if this was / is being discussed as an addition to the language.