shurcooL / home

home is Dmitri Shuralyov's personal website.
https://dmitri.shuralyov.com
MIT License
76 stars 2 forks source link

Go packages were inaccessible from origin server on Oct 28, 2019 from 3:37 am to 7:56 am (UTC−04) #31

Closed JehandadK closed 5 years ago

JehandadK commented 5 years ago
Fetching https://dmitri.shuralyov.com/gpu/mtl?go-get=1
https fetch failed: Get https://dmitri.shuralyov.com/gpu/mtl?go-get=1: dial tcp 172.93.50.41:443: connect: connection refused

Hi, I wonder if this can be fixed again. I see you already faced this once. I guess our servers are in Japan.

Thanks!

dmitshur commented 5 years ago

Hi @JehandadK,

Thanks a lot for letting me know. There was an issue on the web server causing the website to be unavailable. It should be fixed now. Please try again and let me know if you're still seeing any issues.

I'm going to look into improving it to prevent (and detect more quickly) this kind of problem in the future.

I suggest using a module proxy, for example the Go module mirror (https://proxy.golang.org), in order to be able to download modules even when the origin server is temporarily unavailable. @katiehockman's excellent GopherCon 2019 talk covered how module proxies can help mitigate issues such as this one in more detail.

benjaminkomen commented 5 years ago

Am I interpreting https://proxy.golang.org/ correctly that if you use Go 1.13 you will already use the Go module mirror without any specific configuration? When I do go env on my CircleCi build server I see GOPROXY="https://proxy.golang.org,direct"

dmitshur commented 5 years ago

@benjaminkomen That is correct. Also see the second paragraph of the introduction in the Go 1.13 release notes.

dmitshur commented 5 years ago

Here's a timeline of the outage, showing a graph of go get requests being handled:

image

(Times are in EDT, aka UTC−04 timezone.)

The root cause was a bug in the golang.org/x/crypto/acme/autocert package that caused a nil pointer dereference in the HTTPS proxy in front of the home server:

Stack trace
``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x80 pc=0x6cba45] goroutine 17196253 [running]: golang.org/x/crypto/acme/autocert.(*Manager).verifyRFC.func1(0xc0000e0000, 0xc00008e0a8) /Users/dmitri/go/src/golang.org/x/crypto/acme/autocert/autocert.go:774 +0x25 golang.org/x/crypto/acme/autocert.(*Manager).verifyRFC(0xc0000e0000, 0x802040, 0xc000414ea0, 0xc0001ea5b0, 0xc000014be0, 0xc, 0x0, 0x7fc500, 0xa46700) /Users/dmitri/go/src/golang.org/x/crypto/acme/autocert/autocert.go:769 +0x7f0 golang.org/x/crypto/acme/autocert.(*Manager).authorizedCert(0xc0000e0000, 0x802040, 0xc000414ea0, 0x7fdf80, 0xc00037ee40, 0xc000014be0, 0xc, 0x0, 0x0, 0x8, ...) /Users/dmitri/go/src/golang.org/x/crypto/acme/autocert/autocert.go:676 +0x4b5 golang.org/x/crypto/acme/autocert.(*domainRenewal).do(0xc0003b0440, 0x802040, 0xc000414ea0, 0x802040, 0xc000414ea0, 0xc000436500) /Users/dmitri/go/src/golang.org/x/crypto/acme/autocert/renewal.go:110 +0xfb golang.org/x/crypto/acme/autocert.(*domainRenewal).renew(0xc0003b0440) /Users/dmitri/go/src/golang.org/x/crypto/acme/autocert/renewal.go:65 +0x132 created by time.goFunc /usr/local/go/src/time/sleep.go:168 +0x44 ```

The HTTPS proxy program is not setup to be automatically restarted on crash, so all requests stopped being served until it was manually restarted. If automatic restarts were implemented, the outage would've been largely mitigated, but may have been less noticeable and less likely the root cause would be found and fixed (since it'd be easier to ignore). My personal website prioritizes experimentation and development over stability, and so automatic restarts are not used.

The panic happened due to issue golang/go#35225. That issue has since been fixed via CL golang.org/cl/203919, so it should not re-occur.

I've also added an alert that should help notify me of similar issues in the future.

As mentioned in https://github.com/shurcooL/home/issues/31#issuecomment-546918737, if reliability of your build is of high importance, then it's recommended to use a caching module proxy (such as the Go module mirror at https://proxy.golang.org), so that your module's build can be successful even when some origin servers are temporary unavailable. My personal website only has a 95%+ uptime SLA.

Closing since this is resolved.