Closed LeslieLeung closed 4 months ago
For what it's worth we noticed this after upgrading from jwx v2.0.21 to v2.1.0 which upgrades httprc from v1.0.4 to v1.0.5.
The only suspicious change (to me) there is the change in where the queue
mutex is locked in fetchAndStore
here.
I see similar panics with httprc 1.0.5, stemming from code that uses jwx 2.1.0 like LeslieLeung's.
The only suspicious change (to me) there is the change in where the
queue
mutex is locked infetchAndStore
here.
I think so too, the mutex isn't held during the q.fetch.fetch(...)
call anymore. This probably allows a foreground Refresh(ctx)
and the background fetcher to call q.fetch.fetch(...)
concurrently with the same *fetchRequest
pointer, leading to the channel being overwritten and closed twice.
So I realized that my statement above was not quite correct. jwx
v2.0.21
is where the upgrade from v1.0.4
to v1.0.5
happened.
@natenjoy I see that you created the PR that made this mutex change, have you observed these issues at all? @lestrrat any thoughts on this issue?
I don't see it in our logs for the past 3 months. Are you able to reproduce this issue somewhat regularly? If so, could you set GOTRACEBACK=all
and rerun the service, looking for likely causes in other goroutines? Looking at httprc 1.0.5 and those line numbers from the stack trace, the code in the trace appears to be trying to receive from a channel and not send on a closed channel. Sending on a closed channel is what is panicking.
the code in the trace appears to be trying to receive from a channel and not send on a closed channel.
The line number indicated at the top of the traceback is a select
block, which does contain a send case: case reply <- fr:
. This is where the panic originates from, I guess go's tracebacks just don't distinguish between which case caused the panic.
As I mentioned above, I'm pretty sure that the reason for the channel being closed is that 2 goroutines are executing (*fetcher).fetch(...)
with the same *fetchRequest
in parallel. While each fetch()
does create a new reply channel, the channel is written into the same *fetchRequest
pointer. This pointer is then added to f.requests
here once by each call. After the first of those 2 entries in f.requests
is done, the channel is closed in reply(...)
. This leaves the 2nd entry with a closed channel from the start, ultimately causing the panic.
The line number indicated at the top of the traceback is a select block, which does contain a send case: case reply <- fr:
Correct, I misread that second case. I opened https://github.com/lestrrat-go/httprc/pull/31 as a revert commit of the prior change.
Sorry, I completely missed the original report 2 weeks ago. I'm a bit busy today, please give me a day or two to sort things out
Okay. I understand the analysis, and I will take everybody's word for it, but the truth is that I can't reproduce it, so tbh I feel a bit uneasy.
I'll merge #31 and what not, but If anybody can come up with a reliable reproduction test case in the future, I think it will benefit all of us.
merged #31, so will close for now. Please open a new issue if problem persists
Thanks for releasing the fix @lestrrat! I built a small reproduction example as a gist here, in case you are interested: https://gist.github.com/TheJokr/d5b836cca484d4a00967504c553987cf
It panics consistently with v1.0.5
, but I had to rely on timing to make sure the prerequisites for the bug are met. A debugger with breakpoints would likely mess the timing up, so access to httprc
's internals would be necessary for proper debugging. The panic does not occur with v1.0.6
:+1:
@TheJokr Thanks, merged as #32
I am using
github.com/lestrrat-go/jwx/v2 v2.1.0
andgithub.com/lestrrat-go/httprc v1.0.5
, and thefetcher
would randomly panics with the following trace:Given the goroutine is launched inside httprc, I don't have any mean to recover from this panic.
Here is the code on the
jwx
side if it helps:jwx code
```go type Verifier struct { self string pkEndpoint string c *jwk.Cache cs jwk.Set } func NewVerifier(host, self string) (*Verifier, error) { ctx := context.Background() pkEndpoint := host + ApiGetPublicKey c := jwk.NewCache(ctx) err := c.Register(pkEndpoint, jwk.WithMinRefreshInterval(5*time.Minute)) if err != nil { log.Errorf(ctx, "Error registering public key endpoint: %v", err) return nil, err } _, err = c.Refresh(ctx, pkEndpoint) if err != nil { log.Errorf(ctx, "Error fetching public key: %v", err) return nil, err } cached := jwk.NewCachedSet(c, pkEndpoint) v := &Verifier{ self: self, pkEndpoint: pkEndpoint, c: c, cs: cached, } return v, nil } func (v *Verifier) Verify(token string, maxRetry int) (bool, error) { ctx := context.Background() for i := 0; i < maxRetry; i++ { // allow refreshing public keys if fails _, err := jwt.Parse([]byte(token), jwt.WithKeySet(v.cs)) if err == nil { return true, nil } log.Errorf(ctx, "Error verifying token: %v", err) _, err = v.c.Refresh(ctx, v.pkEndpoint) if err != nil { log.Errorf(ctx, "Error refreshing public key: %v", err) return false, err } } return false, nil } ```