Open tombergan opened 7 years ago
@bcmills, I'm reconsidering whether this and #21597 are actually bugs. The documentation for httptrace explicitly says that "Functions may be called concurrently from different goroutines and some may be called after the request has completed or failed." https://golang.org/pkg/net/http/httptrace/#ClientTrace
I don't see any documentation in context
which states that the above is illegal, or that it's illegal to call Value on a context that has gone out-of-scope. The notion of a context being out-of-scope is not even defined in the API documentation:
https://golang.org/pkg/context/
Both this issue and #21597 arise from an internal Google package that implements a context type, where the Value method of that type panics if called after the context has gone out-of-scope (where "out-of-scope" has a well-defined meaning for that specific package). In terms of the vanilla context
package, I'm not convinced this is actually a bug. Thoughts?
In terms of the vanilla
context
package, I'm not convinced this is actually a bug. Thoughts?
Over-retaining a context.Context
argument is a bug regardless of whether the particular values accessed are safe to use. The Context
can carry references to arbitrarily large data structures (e.g. logs for large requests or entire sessions of requests) limited by semaphores or other means, so unexpectedly over-retaining the Context
in background operations — particularly on unusual paths (such as reestablishing connections, which can happen in bursts when a network outage or slowdown resolves) — can lead to large, unpredictable spikes in memory usage.
This is more of an issue with Context
than with other data structures in general, because the Context
can vary in size from a tiny cancellation channel to a very large tree of data.
I don't think there is any such thing as a "stale context." A better way to put this might be that the current happy eyeballs code can keep a context value live unnecessarily. But I don't think this is true. The happy eyeballs code calls context.WithCancel
to get a cancelable context, and as soon as one serial call succeeds, it cancels the context, which will cancel the pending serial calls.
So I don't see a bug here.
I haven't looked at the code for #21597, but from reading the issue my impression is that that code does not currently use a cancelable context to cancel pending actions. If that is correct, then the issues do not seem similar.
as soon as one serial call succeeds, it cancels the context, which will cancel the pending serial calls.
That's true, and makes the over-retention here much less severe in practice than the one in #21597, but because dialParallel
doesn't wait for the second (canceled) call to return, it can still lead to arbitrary increases in memory usage under CPU pressure, where there can be an arbitrarily large scheduling delay between cancelling the Context
and actually returning from the background function call.
Agreed, but my guess is that on balance the current behavior is preferable. It doesn't penalize the common case.
on balance the current behavior is preferable. It doesn't penalize the common case.
In the common case (where we are not overloaded on CPU), it shouldn't make much difference one way or the other. If dialSingle really is responsive to cancellation, the amount of latency added by waiting for the goroutine to return should be negligible compared to the latency of dialing a network connection.
That is: the difference only matters when dialSingle is not responsive, and that is precisely the condition under which the over-retention is likely to matter too.
The common case is that a Context
value is tiny.
AFAIK there are no actual cases where someone suffered an OOM due to net
or net/http
retaining httptrace structs in contexts. This issue and #21597 arose because of a Google-internal package that does the following:
func Go(ctx context, f func(context)) {
c := newCtx(ctx)
go func() {
defer close(c.closed)
f(c)
}()
}
func (c ctx) Value(key interface{}) interface{} {
select{
case <-c.closed:
panic("Value called after ctx done")
default:
}
...
}
The open question is whether that panic should exist. There are two ways to view the question:
Any function that creates goroutines should wait for all of those goroutines to complete before returning (in the absence of a detach feature like #19643). I believe this is @bcmills' position. net.dialParallel
cancels its goroutines but does not wait for them to complete, which means the above panic can be triggered if net.dialParallel
is called by f(ctx)
. Instead, net.dialParallel
should wait for all goroutines to complete before returning.
The above Value implementation should not panic. The context API does not make it illegal to hold onto a context in a background goroutine, therefore, it's not reasonable for Context.Value to panic as in the above implementation.
/cc @zombiezen @Sajmani
The common case is that a
Context
value is tiny.
That's not at all obvious to me.
Yes, an individual Context
value is tiny: but the set of memory that it pins is arbitrarily large. (For example, it might hold a pointer to an arbitrarily large log entry or journal to be flushed at the end of the call to which the Context was passed.)
When happy eyeballs is enabled, DialContext makes parallel calls to dialSingle via dialParallel: https://github.com/golang/go/blob/424b0654f8e6c72f69e096f69009096de16e30fa/src/net/dial.go#L415
Each dialSingle call may call ctx.Value, such as here: https://github.com/golang/go/blob/424b0654f8e6c72f69e096f69009096de16e30fa/src/net/dial.go#L533
dialParallel doesn't wait for all dialSingle calls to complete. This means, in theory, dialSingle can call ctx.Value on a ctx that has gone out of scope. This is effectively the same problem as #21597.
/cc @bcmills