golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.86k stars 17.65k forks source link

proposal: regexp: the regexp package should support context #67278

Open purpleidea opened 5 months ago

purpleidea commented 5 months ago

Go version

latest

Output of go env in your module/workspace:

N/A

What did you do?

N/A

What did you see happen?

Regexp did not cancel early

What did you expect to see?

It would be sensible to have a long running regexp match operation be cancellable with the context package.

I know this package was created before the context package was, but this might be a useful addition for a future extension of this package or golang2.x

Cheers

mauri870 commented 5 months ago

Can you change your proposal to follow the guidelines at https://github.com/golang/proposal? Thanks.

seankhliao commented 5 months ago

How often do you have a match that runs that long? When should the regex check the context for cancellation? there's no io operations

purpleidea commented 5 months ago

When should the regex check the context for cancellation? there's no io operations

This is an excellent question! Here's how you do it. This is from an example where I did something in a similar lib for type unification. You add the ctx select block shown below anywhere there's a long running operation.

func LongTypeUnificationOperation(ctx context.Context, inputData interface{}) error {

        for _, x := range inputData {
                select {
                case <-ctx.Done():
                        return ctx.Err()
                default:
                        // pass
                }

                if err := Unify(x.Actual, x.Expect); err != nil {
                        return err
                }

        }
        // do more stuff...

        return nil
}

HTH and sorry I didn't explain it clearer before.

purpleidea commented 5 months ago

How often do you have a match that runs that long?

It's necessary in latency sensitive contexts, and sometimes it's a user-supplied regexp, so I can't guarantee in advance that it's not slow/exponential or worse/etc. (I forget the specific time complexities possible with/without backtracking and in this lib, but you get the idea...)

purpleidea commented 5 months ago

Can you change your proposal to follow the guidelines at https://github.com/golang/proposal? Thanks.

image

Not sure what I did wrong. Feel free to edit as you see fit.

apparentlymart commented 5 months ago

I guess this proposal calls for inserting a non-blocking read from the context's "Done" channel somewhere in each iteration of this loop:

https://github.com/golang/go/blob/a0a6026bb176d25401d7d188f95c1fe769d71db8/src/regexp/exec.go#L197-L240

(and perhaps some other loops too, but this one seemed the most likely to my not-so-familiar-with-this-package eyes)

However, I wouldn't want to significantly regress the performance of the non-cancelable case given that long-running regex matches seem like an unusual situation. Maybe it would be worth experimenting in a local branch to see what impact it would have to repeatedly try to read a channel in that loop.

If the overhead is material then it seems like there'd either need to be two separate variations of the same function, or there'd need to be some way to avoid that overhead in the case where there's no context provided.

purpleidea commented 5 months ago

@apparentlymart Your whole comment was very helpfully phrased. Thanks for commenting!

If the overhead is material then it seems like there'd either need to be two separate variations of the same function, or there'd need to be some way to avoid that overhead in the case where there's no context provided.

Indeed, it probably makes sense to add a new function that takes a context, and the remaining ones get ported to internally just pass context.Background(). Fully backwards compatibility, but new consumers can use the cancellable version if they want to.

ianlancetaylor commented 5 months ago

It would help to know whether this is a problem that has been observed in practice, or whether this is a theoretical problem. Go's regexp package is based on RE2, which was designed to not have exponential behavior. See https://swtch.com/~rsc/regexp/regexp3.html.

purpleidea commented 5 months ago

Go's regexp package is based on RE2, which was designed to not have exponential behavior. See https://swtch.com/~rsc/regexp/regexp3.html.

Yes awesome, I remember reading this article, thanks for adding the link! That's what I was referring to above.

It would help to know whether this is a problem that has been observed in practice, or whether this is a theoretical problem.

Great question. It's somewhere in the middle. I think this might be generally useful, but I can only offer my personal use-case.

As part of https://github.com/purpleidea/mgmt/ (think a general automation tool like Ansible) users write code in a small DSL. If the user uses a regexp function, we implement that by wrapping the golang regexp library to run their user-supplied regexp when that function runs.

The catch is that we have no control over what kind of regexp the user provides. We are also very good about shutting everything down when there is a global request to cancel. Many others tools when asked to cancel have a long shutdown process, which usually ends up with the user hitting ^\ which can cause data corruption.

To avoid these cases, we want to guarantee we can shutdown very quickly. Say in under 500ms. (Arbitrary, but shorter is better.) I identified the regexp library as one situation where we wouldn't be able to guarantee that.

So TL;DR: there isn't a specific user that I know of personally experiencing this pain point today, but if possible I would like to proactively plumb this in to prevent a future negative user experience or worse.

Happy to answer any other questions.

(Our internal function API that wraps this is a: func(context.Context, []types.Value) (types.Value, error) where types is our internal types/values package (similar to golang's reflect) and the context is how we ask any running function to shutdown sooner where possible.)