google / starlark-go

Starlark in Go: the Starlark configuration language, implemented in Go
BSD 3-Clause "New" or "Revised" License
2.35k stars 212 forks source link

Any plan on adding some support for regular expressions to the language? #241

Open fenollp opened 5 years ago

fenollp commented 5 years ago

This has many parts:

One can definitely implement the first 2 with minimum friction, but as I find Starlark's strength to be in its Turing-incompleteness guarantees I am wondering if this was / is being discussed as an addition to the language.

alandonovan commented 5 years ago

After time and json,

time https://github.com/google/starlark-go/tree/wip-skylark-time json https://github.com/google/starlark-go/tree/json

regexp is the most requested package for the Starlark "standard library". No-one has sketched a prototype yet, but it shouldn't be hard to come up with a sound Starlark API for regular expressions that can be backed by the Go, Java, and C++ implementations of RE2, which is linear time (non-backtracking; see github.com/google/re2), though we will need to disallow \C.

I imagine the basic API would look something like the Go regexp API:

   load("regexp")
   rx = regexp.compile("b(an)*a")
   rx.matches(str src) # returns bool
   rx.find(str src) # returns text of match
   rx.find_all(str src, int max) # returns list of matches
   rx.replace_all(str src , str or function) # returns replaced text
   rx.find_submatches(str) # returns list of complete match and each subpattern match

We might also want variants that return the indices (not text) of each match.

bobjackman commented 4 years ago

Any update on this, yet?

alandonovan commented 4 years ago

Sorry, no.

tv42 commented 4 years ago

For what it's worth, I stumbled on https://godoc.org/github.com/qri-io/starlib/re

essobedo commented 3 years ago

@b5 What about this one? Ready for it? :-)

b5 commented 3 years ago

I think this would be great to get to attack next given the roll we're on with the time and math packages, but the design & implementation of any regex package or language level support requires more of upfront design discussion:

  1. do we write this a package? write a proposal for inclusion in the language? both?
  2. If we're designing a package, does it take inspiration from go's regexp or python's re?
  3. What's providing the underlying regex parser in go?

My answers to these questions:

  1. It should be a package. We should avoid introducing overhead into the language, even if it adds parity to python. I'm thinking specifically of the cost of implementing a starlark interpreter in another language that doesn't have easy support for RE2-style regex syntax. Having it as a package makes it easier to opt out.
  2. Our initial regex implementation follows the python re style, but I think sitting down & working through a minimal design draft is in order.
  3. In my opinion using anything other than go's regexp would be a mistake. But with the choice to use regexp it's important to document any differences for users coming from python.

That said, I'd love to get input from others. Would a re package suit your needs? Any reason we should instead be dedicating time to writing a regex proposal for language inclusion instead?

@alandonovan, would you accept a package if we could get one to pass muster? What's your bandwidth like for either review and/or design input?

b5 commented 3 years ago

🤦 I literally hadn't read the comments above where @adonovan laying out answers to all of my questions. Ok @essobedo I think we can take a stab at implementing the prototype laid out above & get a PR going from there.

mahmoudimus commented 3 years ago

I actually implemented this for the Java version using re2. It works really well for our use case. You can see it here:

It exposed a bug in re2j implementation (PR is coming soon-ish).

adonovan commented 3 years ago

I actually implemented this for the Java version using re2. It works really well for our use case. You can see it here: ...

Interesting, thanks.

It's funny, I wrote RE2/J a decade back for use in the regexp package of a Google in-house embedded configuration language that had implementations in multiple languages and needed a consistent regexp interface. That description now applies equally to Starlark, and it's using RE2/J for the same reason.

BTW, although RE2/J is asymptotically faster than j.u.regex, in practice it's usually significantly slower. Not that that really matters here.

It appear that you are using Starlark as a "secure" embedded language within your VGS product. By design, Starlark is hermetic in many ways, but we've never claimed that it is secure for running untrusted code. Scripts can easily cause denial of service by exhausting all memory, or by hash flooding. Also, RE2/J can be induced to run out of stack when parsing a deep expression such as "(((...(((". (The Go version is somewhat less susceptible to such attacks because Go stacks can grow 100x larger than JVM threads.)

mahmoudimus commented 3 years ago

Scripts can easily cause denial of service by exhausting all memory, or by hash flooding

Yes, I'm aware of this (actually noticed this discussion re: hash flooding in the Fnv32 hash function discussions in the bytes PR :D). We've enlisted Trail of Bits for a code audit (and we'll be happy to contribute whatever findings upstream) and what you've mentioned so far was on our roadmap. This isn't in production yet, not until we've checked all the bells and whistles, but it's actually a pretty neat idea if it works for what we want to use it for.

Also, RE2/J can be induced to run out of stack when parsing a deep expression such as "(((...(((". (The Go version is somewhat less susceptible to such attacks because Go stacks can grow 100x larger than JVM threads.)

I was not aware of this problem, that's good to know, thank you!

essobedo commented 3 years ago

@b5 please, let me know if you cannot work on this feature because if so, no worries, I can do it alone this time 😄

b5 commented 3 years ago

@essobedo sounds great! I'd be delighted to jump on peer review if you can write the implementation! I think it'll get done much quicker that way 😊

essobedo commented 3 years ago

@b5 If so, may I rely on what you have done for https://github.com/qri-io/starlib/commits/master/re?

adonovan commented 3 years ago

Some preliminary questions:

b5 commented 3 years ago

@b5 If so, may I rely on what you have done for https://github.com/qri-io/starlib/commits/master/re?

Absolutely I see two ways to get started:

  1. If you'd like to file a PR into github.com/qri-io/starlib I can promise a through review there before we open a PR into this repo.
  2. We can use the same workflow we've been using, I'll open an initial PR that you can take over.

Either works for me, just let me know which you'd prefer @essobedo.

@adonovan:

What regexp syntax are implementations to use? I think the only sensible answer, for reasons of portability and asymptotic efficiency, is RE2. We should link to the RE2 Syntax page.

Very much agreed here. The PR we file should use RE2 syntax and link to current spec in package documentation.

We should also make clear that byte-oriented features (i.e. just \C) are not supported, since these are infeasible to implement in Java or other languages whose strings are encoded as UTF-16.

Noted. Bonus points if we can provide a clear error when a user attempts to construct a regex with \C.

Should we offer separate compile/match operations?

I think we should go the interpreted language route, at least at first. It's a non-breaking change to add regex compilation later. Python's re package offers compile.

My question for you @adonovan: can we compile expressions on .Freeze()?

Do we need to offer variants of the match function that return indices, not substrings? What does Python do here?

I need to do a little more digging on this & report back. re.Match objects (the result of calling re.match) can be addressed by index. Your initial package API from above in this thread does not spec any kind of Match return type, which may allow us to sidestep the issue at first?

We should offer a function to quote RE metacharacters so that quote(s) is a regexp matching exactly and only s.

Noted!

Another thing to consider long-term is a cache of compiled regular expressions, my question here is more about where that should be stored, and establishing a policy on how packages should handle something like a cache. I'm assuming this should be stored on the starlark thread. I don't think this is needed for an initial implementation, but would be great to get some initial guidance

essobedo commented 3 years ago

Absolutely I see two ways to get started:

  1. If you'd like to file a PR into github.com/qri-io/starlib I can promise a through review there before we open a PR into this repo.
  2. We can use the same workflow we've been using, I'll open an initial PR that you can take over.

The way #2 would be great if it is possible for you

adonovan commented 3 years ago

can we compile expressions on .Freeze()?

I'm not sure I understand what you're getting at. Patterns are immutable, so their freeze method is a no-op. And compilation can fail, so you need to report the error immediately. Also, you can compile and run a pattern in the same module, before freeze comes into play.

b5 commented 3 years ago

Your confusion answers my question 😄. I was having trouble building a mental model of the point where compilation is happening

b5 commented 3 years ago

Before moving forward, I think we should double check that this is the API we'd like to target for an initial package. This is @adonovan's suggested API, and the one I think we should go with:

load("regexp")
rx = regexp.compile("b(an)*a")
rx.matches(str src)                       # returns bool
rx.find(str src)                          # returns text of match
rx.find_all(str src, int max)             # returns list of matches
rx.replace_all(str src , str or function) # returns replaced text
rx.find_submatches(str)                   # returns list of complete match and each subpattern match

I made a mistake in my earlier comment regarding compile/match. I should have said that I think se can slip the match operations at first, and just use this compile-only API to start with.