kevburnsjr / microcache

A non-standard HTTP cache implemented as Go middleware
MIT License
155 stars 5 forks source link

Microcache sometimes reports errors #16

Closed kevburnsjr closed 5 years ago

kevburnsjr commented 5 years ago

Under load, Microcache sometimes reports errors.

> GOMAXPROCS=4 go run examples/basic/example.go
2019/10/02 10:02:35 Size: 1, Total: 0, Hits: 0, Misses: 0, Stales: 0, Backend: 0, Errors: 0
2019/10/02 10:02:40 Size: 1, Total: 10241, Hits: 10241, Misses: 0, Stales: 0, Backend: 0, Errors: 0
2019/10/02 10:02:45 Size: 1, Total: 44506, Hits: 44506, Misses: 0, Stales: 0, Backend: 0, Errors: 0
2019/10/02 10:02:50 Size: 1, Total: 41164, Hits: 41159, Misses: 0, Stales: 5, Backend: 4, Errors: 4
2019/10/02 10:02:55 Size: 1, Total: 0, Hits: 0, Misses: 0, Stales: 0, Backend: 0, Errors: 0

In the example above we're only requesting one URL and it never reports any status other than 200 so there should be at most 1 backend request and 0 errors, not 4 backend and 4 errors. Likely has something to do with stale-while-revalidating. Requires further testing.

erikdubbelboer commented 5 years ago

example.go sets a Timeout of 3 seconds, which means http.TimeoutHandler is being used. This will return 503 (Service Unavailable) on a timeout. So I'm guessing you were maxing out your CPU and the Goroutines didn't get enough CPU to finish within 3 seconds?

The 4 backends instead of one can then be explained by the 4 errors.

erikdubbelboer commented 5 years ago

Try the same again without setting Timeout.

kevburnsjr commented 5 years ago

You're right that the error is coming from TimeoutHandler, but for a different reason.

Turns out that reusing the original request context in a background revalidation request causes TimeoutHandler to short circuit when context.Done() channel is closed after the client disconnects. TimeoutHandler thinks the background request timed out when it didn't.

Cloning a context isn't really a thing, so I'm looking at ways to wrap the original context for reuse downstream. The context may still contain important information for downstream handlers (ie. distributed tracing info), so we can't just throw it away.

Here's what I'm thinking

https://github.com/kevburnsjr/microcache/pull/17/files

So far it works. Preparing a test now.

kevburnsjr commented 5 years ago

BTW, this is a serious error and is almost certainly the root cause of the stale object problem we saw in Crunchyroll's production CMS API with a similar configuration.

Wrapping the parent context with a separate done channel feels awkward. Open to alternate solutions.

kevburnsjr commented 5 years ago

Fixed in #17