fetcher autonomously panics

LeslieLeung commented 4 months ago

I am using github.com/lestrrat-go/jwx/v2 v2.1.0 and github.com/lestrrat-go/httprc v1.0.5, and the fetcher would randomly panics with the following trace:

panic: send on closed channel

goroutine 194 [running]:
github.com/lestrrat-go/httprc.(*fetchResult).reply(0xc000ba8b70, {0x2104a00, 0x3007980}, 0xc004ba5b60)
/go/pkg/mod/github.com/lestrrat-go/httprc@v1.0.5/fetcher.go:37 +0x86
github.com/lestrrat-go/httprc.runFetchWorker({0x2104a00, 0x3007980}, 0xc0000bbbc0, {0x0, 0x0})
/go/pkg/mod/github.com/lestrrat-go/httprc@v1.0.5/fetcher.go:177 +0x2ac
created by github.com/lestrrat-go/httprc.NewFetcher in goroutine 51
/go/pkg/mod/github.com/lestrrat-go/httprc@v1.0.5/fetcher.go:75 +0x1bc

Given the goroutine is launched inside httprc, I don't have any mean to recover from this panic.

Here is the code on the jwx side if it helps:

jwx code

```go type Verifier struct { self string pkEndpoint string c *jwk.Cache cs jwk.Set } func NewVerifier(host, self string) (*Verifier, error) { ctx := context.Background() pkEndpoint := host + ApiGetPublicKey c := jwk.NewCache(ctx) err := c.Register(pkEndpoint, jwk.WithMinRefreshInterval(5*time.Minute)) if err != nil { log.Errorf(ctx, "Error registering public key endpoint: %v", err) return nil, err } _, err = c.Refresh(ctx, pkEndpoint) if err != nil { log.Errorf(ctx, "Error fetching public key: %v", err) return nil, err } cached := jwk.NewCachedSet(c, pkEndpoint) v := &Verifier{ self: self, pkEndpoint: pkEndpoint, c: c, cs: cached, } return v, nil } func (v *Verifier) Verify(token string, maxRetry int) (bool, error) { ctx := context.Background() for i := 0; i < maxRetry; i++ { // allow refreshing public keys if fails _, err := jwt.Parse([]byte(token), jwt.WithKeySet(v.cs)) if err == nil { return true, nil } log.Errorf(ctx, "Error verifying token: %v", err) _, err = v.c.Refresh(ctx, v.pkEndpoint) if err != nil { log.Errorf(ctx, "Error refreshing public key: %v", err) return false, err } } return false, nil } ```

milescrabill commented 4 months ago

For what it's worth we noticed this after upgrading from jwx v2.0.21 to v2.1.0 which upgrades httprc from v1.0.4 to v1.0.5.

The only suspicious change (to me) there is the change in where the queue mutex is locked in fetchAndStore here.

TheJokr commented 4 months ago

I see similar panics with httprc 1.0.5, stemming from code that uses jwx 2.1.0 like LeslieLeung's.

The only suspicious change (to me) there is the change in where the queue mutex is locked in fetchAndStore here.

I think so too, the mutex isn't held during the q.fetch.fetch(...) call anymore. This probably allows a foreground Refresh(ctx) and the background fetcher to call q.fetch.fetch(...) concurrently with the same *fetchRequest pointer, leading to the channel being overwritten and closed twice.

milescrabill commented 4 months ago

So I realized that my statement above was not quite correct. jwx v2.0.21 is where the upgrade from v1.0.4 to v1.0.5 happened.

@natenjoy I see that you created the PR that made this mutex change, have you observed these issues at all? @lestrrat any thoughts on this issue?

natenjoy commented 4 months ago

I don't see it in our logs for the past 3 months. Are you able to reproduce this issue somewhat regularly? If so, could you set GOTRACEBACK=all and rerun the service, looking for likely causes in other goroutines? Looking at httprc 1.0.5 and those line numbers from the stack trace, the code in the trace appears to be trying to receive from a channel and not send on a closed channel. Sending on a closed channel is what is panicking.

TheJokr commented 4 months ago

the code in the trace appears to be trying to receive from a channel and not send on a closed channel.

The line number indicated at the top of the traceback is a select block, which does contain a send case: case reply <- fr:. This is where the panic originates from, I guess go's tracebacks just don't distinguish between which case caused the panic.

As I mentioned above, I'm pretty sure that the reason for the channel being closed is that 2 goroutines are executing (*fetcher).fetch(...) with the same *fetchRequest in parallel. While each fetch() does create a new reply channel, the channel is written into the same *fetchRequest pointer. This pointer is then added to f.requests here once by each call. After the first of those 2 entries in f.requests is done, the channel is closed in reply(...). This leaves the 2nd entry with a closed channel from the start, ultimately causing the panic.

natenjoy commented 4 months ago

The line number indicated at the top of the traceback is a select block, which does contain a send case: case reply <- fr:

Correct, I misread that second case. I opened https://github.com/lestrrat-go/httprc/pull/31 as a revert commit of the prior change.

lestrrat commented 4 months ago

Sorry, I completely missed the original report 2 weeks ago. I'm a bit busy today, please give me a day or two to sort things out

lestrrat commented 4 months ago

Okay. I understand the analysis, and I will take everybody's word for it, but the truth is that I can't reproduce it, so tbh I feel a bit uneasy.

I'll merge #31 and what not, but If anybody can come up with a reliable reproduction test case in the future, I think it will benefit all of us.

lestrrat commented 4 months ago

merged #31, so will close for now. Please open a new issue if problem persists

TheJokr commented 4 months ago

Thanks for releasing the fix @lestrrat! I built a small reproduction example as a gist here, in case you are interested: https://gist.github.com/TheJokr/d5b836cca484d4a00967504c553987cf

It panics consistently with v1.0.5, but I had to rely on timing to make sure the prerequisites for the bug are met. A debugger with breakpoints would likely mess the timing up, so access to httprc's internals would be necessary for proper debugging. The panic does not occur with v1.0.6 :+1:

lestrrat commented 4 months ago

@TheJokr Thanks, merged as #32

lestrrat-go / httprc

fetcher autonomously panics #30