proposal: Go 2: interruptible goroutines

KevinWang15 commented 2 years ago

Hi,

Can we have the feature to kill one goroutine from another goroutine?

I have read https://github.com/golang/go/issues/32610 and several other disucssions. I believe the answer was NO, but I still want to present my use cases and experiments for your consideration.

I was working on Kubernetes APIServer. When APIServer was OOMKilled and restarted, all clients are going to reconnect to APIServer and try to download all the data they need from APIServer to their local cache. They do this by sending a LIST request.

After an APIServer restarts, there would be tons of concurrent LIST requests, so CPU would be very overloaded. The HTTP response was quite big, with every CPU core saturated, json.Marshal of the response would take on average 300 seconds to complete (whereas they would take only 10 seconds to complete had the CPU not been so busy).

The interesting thing is, APIServer will return a GatewayTimeout response after 60 seconds, and the client will retry the request on receiving this error.

So at the 60 second mark, APIServer and the client have both given up on the request, the request context was cancelled, but the goroutine was still actively doing json.Marshal of the response for the request! (because json.Marshal wasn't supposed to be context-aware.)

That's the last thing we want when the server was already overloaded. In the next 240 seconds that goroutine was wasting CPU for nothing, and it was not releasing the memory. Soon the clients would retry the LIST request and another league of goroutines would be created to serve the requests - the overload just got worse!

In my observation, due to the lingering json.Marshal goroutine, each client will result in the creation of 5 goroutines, and use 5x the CPU and memory.

I actually did a little experiment and hacked src/runtime.

a goroutine can get its own goid, so I stored the goid of the serving goroutine in the request context
a function to kill any goroutine (make that goroutine panic) by goid is added, so when GatewayTimeout occurs, I could read the goroutine that was doing json.Marshal from request context and make it panic

The results are exactly what we want! 60% memory usage reduction (and much less prone to OOMKills) and faster recovery.

(P.S. we changed some other behaviors of APIServer, e.g. returning Retry-After header to disperse client-side retries and imposing a smaller MaxRequestsInflight when a LIST flood is detected. And the above experiment already had those optimizations turned on)

This issue is not limited to Kubernetes APIServer. I guess every server powered by golang would have more or less the same problem, and would all benefit from an API for killing goroutines (non-cooperatively, because using context / channels to cooperatively make goroutines exit is sometimes not viable).

If this use case could be considered valid, I would like to humbly request to add such an API. It doesn't have to expose the goid to the user, just allowing the user to get a handle of the current goroutine, store it somewhere, and then provide an Abort() method that will make the goroutine panic at the earliest convenience, would be good enough.

Thanks!

davecheney commented 2 years ago

This use case sounds like the the kubernetes Api server should add some basic rate limiting.

KevinWang15 commented 2 years ago

This use case sounds like the the kubernetes Api server should add some basic rate limiting.

Right, there are some rate limiting in place, and I also worked on tuning the algorithms there.

But if we could have this API, it would be invaluable, and bring the whole thing to the next level.

beoran commented 2 years ago

I would rather say you need a better json parser that is interruptable through a context.

KevinWang15 commented 2 years ago

I would rather say you need a better json parser that is interruptable through a context.

Well, a better JSON parser certainly will help. Apart from json.Marshal, there are some other CPU-intensive function calls that are not context-aware. It would be much work if we replace everything with a context-aware alternative.

And is it good to make everything context-aware? Is it going to add a lot of overhead? Quoting this comment: https://github.com/golang/go/issues/50422#issuecomment-1004768507

mvdan commented: Usually, a context or timeout/deadline is required for asynchronous work where it's practically impossible to know how long an operation will take. I also imagine it would be useful for libraries that interpret or execute code, such as interpreting a script that might loop forever. My first impression is that encoding/json falls under neither of those categories.

When all these problems can be solved with one single added API in golang runtime, and it isn't that hard to implement, I am tempted not to do things the other way around.

I guess it all boils down to, is it so abominably against the philosophies and principles of golang design? Or is it negotiable?

seankhliao commented 2 years ago

Please fill out https://github.com/golang/proposal/blob/master/go2-language-changes.md when proposing language changes

KevinWang15 commented 2 years ago

ACK, I will fill it out, thanks

KevinWang15 commented 2 years ago

Would you consider yourself a novice, intermediate, or experienced Go programmer?

I consider myself quite experienced.

What other languages do you have experience with?

Java, JavaScript

Would this change make Go easier or harder to learn, and why?

I think it should make no difference. It is only an extra API that a developer can choose to use. They can learn it when they want to use it.

Has this idea, or one like it, been proposed before?

The closest is question https://github.com/golang/go/issues/32610. We are asking for the same feature. The previous discussion was abandoned and didn't offer a use case. So I decided to bring it up again.

Who does this proposal help, and why?

Everyone who uses Golang to write a server, and potentially some other use cases. It will enable developers to make sure that when the server has finished sending a response for a request, every goroutine that was used to serve the request can be aborted immediately. No lingering goroutines will get stuck in expensive function calls that are not context-aware, and waste CPU / memory.

What is the proposed change?

Add runtime.GetGoroutineHandle() that will return a *GoroutineHandle of the caller goroutine.
Provide .Abort() method on *GoroutineHandle that will make the related goroutine panic with "aborted" at the earliest convenience (when it is safe to do so). All defers will be executed in order not to break invariants.

Nothing would change in the language spec.

Please also describe the change informally, as in a class teaching Go.

"Now it's possible to kill a goroutine just like you can kill a thread. Of course, it is dangerous and not encouraged; Context and other cooperative methods are still your go to solution, but it is a possibility if you know what you are doing"

Is this change backward compatible?

Yes.

Show example code before and after the change.

func main() {
    goroutineHandle := make(chan *GoroutineHandle)
    go func() {
        goroutineHandle <- runtime.GetGoroutineHandle()
        json.Marshal(aVeryBigObject)
    }()

    timer := time.NewTimer(10 * time.Second)
    <-timer.C
    (<-goroutineHandle).Abort()
}

What is the cost of this proposal?

It will not affect tools or add compile time / run time cost.

Can you describe a possible implementation?

Put a goid into the ToBeAborted set, wait for it to get descheduled naturally (either by Gosched or preemptive scheduling). Then, when it is about to resume, change its PC to a function that will panic immediately; Or, add such a check in Gosched.

How would the language spec change?

The language spec wouldn't change.

Orthogonality: how does this change interact or overlap with existing features?

No, it's just another API. Just like Thread.stop provided by Java.

Is the goal of this change a performance improvement?

Yes. We can expect server programs to waste less CPU and memory. See my analysis and experiments in https://github.com/golang/go/issues/50678

Does this affect error handling? / Is this about generics?

No

PTAL @seankhliao

bcmills commented 2 years ago

(emphasis mine)

make the related goroutine panic with "aborted" at the earliest convenience (when it is safe to do so)

That's the problem: it is never safe to make a goroutine panic unexpectedly.

Go programs can (and should!) be written and tested such that unexpected panics cannot and do not occur — especially given Go's new fuzzing support set to be released in Go 1.18. In particular, panics should not cross package boundaries except in the case of programmer error: if package A calls into package B, package B should only panic if A has violated some documented or obvious invariant of its API.

If a particular package knows that it is safe to panic at some particular point, that package could just as easily check a Context for cancellation at that point too.

KevinWang15 commented 2 years ago

Right, it is never safe to make a goroutine panic unexpectedly, as unsafe as killing a thread in Java. Maybe it should be put into the unsafe package, and people should never use it unless they know what they are doing. Shall we give developers the power to do it if they really want to? (Java did). If the program crashes, then the developer is to blame because he played with fire and got burned.

By when it is safe to do so I meant, from go runtime's point of view. It's indeed impossible to guarantee there will be no logic error or deadlocks when a goroutine is aborted abruptly. I guess that responsibility falls on the developer.

bcmills commented 2 years ago

Shall we give developers the power to do it if they really want to? (Java did).

Go is not Java, nor should it be: Java is already a pretty good Java, so Go making different decisions adds diversity to the language ecosystem.

If the program crashes, then the developer is to blame because he played with fire and got burned.

Go consistently chooses not to take that approach. (That's why we have automatic bounds checks and a fairly permissive memory model, and also why the language discourages pointer arithmetic and subtle memory-management patterns.)

Besides, the failure mode of an unexpected panic isn't always a crash. Sometimes it's much more difficult to diagnose, like a deadlock in a production server.

KevinWang15 commented 2 years ago

Okay, I see your point, I can understand. Thanks

beoran commented 2 years ago

If you use a Context-aware io.Reader like here: https://pace.dev/blog/2020/02/03/context-aware-ioreader-for-golang-by-mat-ryer.html, together with a Json parser that can unmarshal from an io.Reader, and there are several like that, then your problem should be solved.

KevinWang15 commented 2 years ago

If you use a Context-aware io.Reader like here: https://pace.dev/blog/2020/02/03/context-aware-ioreader-for-golang-by-mat-ryer.html, together with a Json parser that can unmarshal from an io.Reader, and there are several like that, then your problem should be solved.

Thanks for the suggestion. The current encoding/json has func NewEncoder(w io.Writer) which can take in a io.Writer, but its behavior isn't ideal for this use case ( https://github.com/golang/go/issues/33714 , basically it completes the whole marshaling in a buffer and writes to io.Writer in one shot ).

Maybe there open-source alternatives to encoding/json that behaves differently (it needs to work with json.Marshaler interface or we will have to rewrite a lot of code), I will take a look.

Our solution for now is to hack src/runtime and src/encoding/json and make the goroutine panic when the request was cancelled. This solution has the upside of being very easy to implement (100 lines of code), but the downside is we will have to use a forked (hacked) Golang. I guess when https://github.com/golang/go/issues/33714 is solved by upstream we can try that way too.

mvdan commented 2 years ago

I agree with @beoran that a context-aware reader with a streaming decoder (or a context-aware writer with a streaming encoder for marshals) seems like a good approach. encoding/json doesn't currently support streaming encodes (https://github.com/golang/go/issues/33714), but that's certainly something we want to fix in the future.

smasher164 commented 2 years ago

If you use a Context-aware io.Reader like here: https://pace.dev/blog/2020/02/03/context-aware-ioreader-for-golang-by-mat-ryer.html, together with a Json parser that can unmarshal from an io.Reader, and there are several like that, then your problem should be solved.

Interesting. So this basically turns the early exits from error handling into preemption points. At least, most of the error handling, since the errors returned from a Reader are only a subset of all the errors returned by a JSON parser.

ianlancetaylor commented 2 years ago

Based on the discussion above, this is a likely decline. Leaving open for four weeks for final comments.

KevinWang15 commented 2 years ago

Thank you, I'm fine with closing this

ianlancetaylor commented 2 years ago

No change in consensus.

golang / go