Closed mitjat closed 1 year ago
Some relevant info from Slack:
1) This operation is costly. I ran it 20 minutes ago and it just finished. My steps:
kubectl -n mainnet exec -i -t mainnet-all-v2202-oasis-client-0 -- bash
oasis-node control status -a unix:/storage/node/internal.sock | grep latest_height
time oasis-node genesis dump --height 10957153 --genesis.file /tmp/mydump.json -a unix:/storage/node/internal.sock
real 18m10.842s
and the resulting file is 96MB in size (60MB with no pretty-printing). The success here does mean that the linux socket doesn't impose timeouts.
2) Might be a GRPC timeout: The grpc server we want seems to start here, which then fails to explicitly set a timeout either here or 25 lines higher (not sure which one we end up using). That said, Andy reports things timing out after 70-90 seconds.
3) @gw0 fixed nginx timeouts, also with no observable change in behavior. Quote: There are basically two Nginx instances between the endpoint and oasis-node. :slightly_smiling_face: Ingress is implemented with the nginx-ingress-controller that has a timeout (overridable with annotations, like https://github.com/oasislabs/private-charts/blob/master/charts/parcel-gateway/templates/ingress.yaml#L18-L20, docs https://docs.nginx.com/nginx-ingress-controller/configuration/ingress-resources/advanced-configuration-with-annotations/#summary-of-annotations) it sends traffic to the K8s Service. To bypass it, you can port-forward to the K8s Service. But there is another Nginx in a container inside the charts/oasis-client, its purpose is to proxy TCP/HTTP traffic to the internal.sock UNIX socket. This one also sets some timeouts, see https://github.com/oasislabs/private-charts/blob/master/charts/oasis-client/templates/nginx_config.yaml. [...] increased both Ingress and internal nginx-proxy timeouts to 1 hour (https://github.com/oasislabs/private-ops/pull/2592/commits/9e9cf11e2629cc097ada23c5d8cf442b5a77cfd7), already applied to new *-all nodes.
4) nginx-proxy logs contain an error line wiht HTTP 499, which means that the client (!) closed the connection while the server was still processing it.
(k8s: gcp/production/us-west2) root@ca4d2e1059f2 /workdir# kubectl -n mainnet logs pods/mainnet-grpc-v2202-oasis-client-0 --container nginx-proxy | grep -v status:200 [47/16193]
remote_addr:10.31.3.33 time_local:28/Oct/2022:19:23:03 +0000 host:upstream_balancer status:499 bytes_sent:0 useragent:grpc-go/1.48.0 request_time:0.006
remote_addr:10.31.3.33 time_local:28/Oct/2022:19:23:27 +0000 host:upstream_balancer status:499 bytes_sent:0 useragent:grpc-go/1.48.0 request_time:0.002
remote_addr:10.31.3.33 time_local:28/Oct/2022:19:23:56 +0000 host:upstream_balancer status:499 bytes_sent:0 useragent:grpc-go/1.48.0 request_time:0.001
@gw0 do you have time to pair on this with @Andrew7234 ? He is currently stuck.
We don't have a node up in staging or prod right now, but you can deploy a fresh one easily; I hope/think its state will be immediately large enough to time out for StateToGenesis
calls.
There are many "dump" commands available in oasis-node
: https://github.com/oasisprotocol/oasis-core/blob/master/go/oasis-node/cmd/debug/dumpdb/dumpdb.go
Dumping a code snippet here in order to not waste time while debugging a specific gRPC call....
package main
import (
"context"
"fmt"
"os"
"runtime/debug"
"google.golang.org/grpc"
oasisGrpc "github.com/oasisprotocol/oasis-core/go/common/grpc"
consensus "github.com/oasisprotocol/oasis-core/go/consensus/api"
)
const (
//grpcAddress = "unix:/storage/node/internal.sock"
grpcAddress = "127.0.0.1:9000"
)
func exitErr(err error) {
if err != nil {
fmt.Println("err:", err)
debug.PrintStack()
os.Exit(1)
}
}
func main() {
// Connect to gRPC
ctx := context.Background()
fmt.Println("connect to", grpcAddress)
nodeConn, err := oasisGrpc.Dial(grpcAddress, grpc.WithInsecure())
exitErr(err)
cons := consensus.NewConsensusClient(nodeConn)
// Get node status (genesis height and latest height)
status, err := cons.GetStatus(ctx)
exitErr(err)
fmt.Println("latest height", status.LatestHeight)
// Dumping consensus state to genesis file
doc, err := cons.StateToGenesis(ctx, status.LatestHeight)
exitErr(err)
canonJSON, err := doc.CanonicalJSON()
exitErr(err)
fmt.Println("genesis dump", canonJSON[:200], "...")
}
I use the above code snippet to performed an experiment inside pod/mainnet-grpc-v2202-oasis-client-tmp-0
in gcp/staging/us-west2
. Everything worked when connecting directly to the UNIX socket (unix:/storage/node/internal.sock
), but it timed out after 60 sec when connecting to the nginx-proxy
sidecar container (127.0.0.1:9000
).
$ kmn logs pod/mainnet-grpc-v2202-oasis-client-tmp-0 -c nginx-proxy
2022/11/11 10:44:25 [error] 7#7: *6 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: , request: "POST /oasis-core.Consensus/StateToGenesis HTTP/2.0", upstream: "grpc://unix:/storage/node/internal.sock", host: "127.0.0.1:9000"
remote_addr:127.0.0.1 time_local:11/Nov/2022:10:44:25 +0000 host:127.0.0.1 status:504 bytes_sent:167 useragent:grpc-go/1.46.2 request_time:60.060
It turned out the timeout configuration in nginx-proxy
did not work as expected (it configured HTTP proxy timeouts, instead of gRPC proxy timeouts). Manually tested that the fix in PR https://github.com/oasislabs/private-charts/pull/544 works.
Tried re-running the indexer healthcheck and it's still timing out with 504 errors after ~70s:
genesis_test.go:228: staking genesis error: rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 504 (Gateway Timeout); transport: received unexpected content-type "text/html"
genesis_test.go:229:
Error Trace: /oasis-indexer/tests/genesis/genesis_test.go:229
Error: Expected nil, but got: &status.Error{s:(*status.Status)(0xc0001180e0)}
Test: TestGenesisFull
--- FAIL: TestGenesisFull (69.21s)
Gregor's tests indicate that the oasis-client nginx instance is working correctly. I thought that the issue might be the other nginx instance for the ingress; but the docs indicate that setting the proxy-read-timeout to 3600s should also set the grpc-read-timeout as well.
Bummer. And very surprising to see Gregor's results, given your documentation find ...
Some more tests come to mind:
Ah, also: The error you show is 504 (Gateway Timeout); transport: received unexpected content-type "text/html"
, but I expect this change should stop text/html
responses from nginx. Either that config change doesn't do what I expected it would do, or maybe you're running against an non-updated node? Note that the -tmp
deployment is not updated yet.
Hm I was testing against the ingress mainnet-grpc-v2202.gcp-prod-usw2.local:443
; perhaps it was pointing to the tmp
deployment? Will investigate further
Can you repro Gregor's success at a lower height?
This shouldn't be a factor currently because the indexer is synced and thus the healthcheck tries to fetch the state at a recent height.
Can you repro success using Gregor's script from the indexer container?
Tried from the production indexer container: was unable to connect. Will keep digging
root@mainnet-oasis-indexer-c65bd9bbf-hkdvf:/go/src# ./node-test
connect to mainnet-grpc-v2202.gcp-prod-usw2.local:443
err: rpc error: code = Unavailable desc = connection closed before server preface received
goroutine 1 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
/usr/local/go/src/runtime/debug/stack.go:16 +0x19
main.exitErr({0xbb77a0?, 0xc0000ac038?})
/go/src/test.go:22 +0x85
main.main()
/go/src/test.go:37 +0x135
Edit:
Doesn't look like it was pointing to the tmp
deployment when I tested this morning.
maybe Gregor's successful test just happened to take less than minute.
No, connection to 127.0.0.1:9000
(pod:nginx-proxy
) took quite a while to dump the genesis file. I just reconfirmed that it also does not timeout on gcp/prod/usw2
(yesterday I was testing on gcp/stag/usw2
).
Let me continue debugging on gcp/prod/usw2
from inside the mainnet-grpc-v2202-oasis-client-0
pod (compiled binaries in /tmp
) with the following gRPC addresses:
grpcAddress = "unix:/storage/node/internal.sock" // pod:oasis-client -- OK
grpcAddress = "127.0.0.1:9000" // pod:nginx-proxy -- OK
grpcAddress = "mainnet-grpc-v2202-oasis-client.mainnet.svc:9000" // svc -- OK
grpcAddress = "10.30.32.2:443" // ing-ip -- "connection closed before server preface received"
grpcAddress = "mainnet-grpc-v2202.gcp-prod-usw2.local:443" // ing -- "connection closed before server preface received"
Ingress controller nginx-private logs are full of:
$ k logs pod/nginx-ingress-nginx-private-controller-55f57d4bc-v54ph
2022/11/15 14:16:06 [error] 747#747: *14721058 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: 10.30.0.38, server: mainnet-grpc-v2202.gcp-prod-usw2.local, request: "POST /oasis-core.Consensus/GetBlock HTTP/2.0", upstream: "grpcs://10.31.3.32:9000", host: "mainnet-grpc-v2202.gcp-prod-usw2.local:443"
nginx-ingress-nginx-private-controller-55f57d4bc-v54ph
is using nginx/1.19.1
(from 2018... gulp). mainnet-grpc-v2202-oasis-client-0
pod is using nginx/1.23.2
. gRPC support was added in Nginx 1.13.10 (https://www.nginx.com/blog/nginx-1-13-10-grpc/) and TLSv1.3 was added in Nginx 1.13 (https://www.howtoforge.com/how-to-enable-tls-13-in-nginx/), so this should not be the reason.
@ravenac95 Some time ago, we decided to use Kong ingress controller for gRPC calls. Was this due to gRPC issues with Nginx ingress controller?
@ravenac95 Some time ago, we decided to use Kong ingress controller for gRPC calls. Was this due to gRPC issues with Nginx ingress controller?
Hrm. So we used kong on parcel for reasons unrelated to grpc at the time. We do, however, use envoy for the grpc frontend on our public grpc endpoints. It just had the features we needed. Envoy was more purposefully built with grpc as an intended backend. Perhaps it would behoove us to use that internally for this.
@Andrew7234 In following days I will experimentally deploy a newer version of the Nginx ingress controller, so that we can see if version 1.19.1 is causing issues. In the meantime, I believe it is best to use the K8s Service directly. So, please switch the tests to using mainnet-grpc-v2202-oasis-client.mainnet.svc:9000
.
Thank you. @Andrew7234 note that the Service (:9000
) is not encrypted, and depending on which oasis-core
API you use to connect, the connection function might assume that everything except unix sockets has encrypted traffic. So if you get errors related to TLS, you'll have to use the lower-level oasisGrpc.Dial()
from Gregor's example above, or adapt in some other way.
Still not sure what is going on with "connection closed before server preface received"
(maybe Nginx version issue, or it expects a TLS/non-TLS connection)...
But here is the confirmation that regarding the 60s
timeout we indeed misconfigured the Ingress
(by looking at the nginx-private
ingress controller internal configuration):
## before with original Ingress annotations
$ kmn describe ing/mainnet-grpc-v2202-oasis-client
nginx.ingress.kubernetes.io/proxy-connect-timeout: 3600s
nginx.ingress.kubernetes.io/proxy-read-timeout: 3600s
nginx.ingress.kubernetes.io/proxy-send-timeout: 3600s
$ k exec -it pod/nginx-ingress-nginx-private-controller-55f57d4bc-v54ph -- cat /etc/nginx/nginx.conf | grep -A200 '## start server mainnet-grpc-v2202.gcp-prod-usw2.local' | grep -B200 '## end server mainnet-grpc-v2202.gcp-prod-usw2.local'
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
## after removing the "s" in Ingress annotations
$ kmn describe ing/mainnet-grpc-v2202-oasis-client
nginx.ingress.kubernetes.io/proxy-connect-timeout: 3600
nginx.ingress.kubernetes.io/proxy-read-timeout: 3600
nginx.ingress.kubernetes.io/proxy-send-timeout: 3600
$ k exec -it pod/nginx-ingress-nginx-private-controller-55f57d4bc-v54ph -- cat /etc/nginx/nginx.conf | grep -A200 '## start server mainnet-grpc-v2202.gcp-prod-usw2.local' | grep -B200 '## end server mainnet-grpc-v2202.gcp-prod-usw2.local'
proxy_connect_timeout 3600s;
proxy_send_timeout 3600s;
proxy_read_timeout 3600s;
Dumping the code snippet for debugging the timeout issue:
package main
import (
"context"
"fmt"
"os"
"runtime/debug"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials"
"google.golang.org/grpc/credentials/insecure"
"crypto/x509"
"crypto/tls"
"encoding/base64"
oasisGrpc "github.com/oasisprotocol/oasis-core/go/common/grpc"
consensus "github.com/oasisprotocol/oasis-core/go/consensus/api"
)
func exitErr(err error) {
if err != nil {
fmt.Println("err:", err)
debug.PrintStack()
os.Exit(1)
}
}
type basicAuth struct {
username string
password string
}
func (b basicAuth) GetRequestMetadata(ctx context.Context, in ...string) (map[string]string, error) {
auth := b.username + ":" + b.password
enc := base64.StdEncoding.EncodeToString([]byte(auth))
return map[string]string{
"authorization": "Basic " + enc,
}, nil
}
func (basicAuth) RequireTransportSecurity() bool {
return true
}
func main() {
var utcZone = time.FixedZone("UTC", 0)
time.Local = utcZone
// Parse options
grpcAddress, grpcTlsMode := "unix:/storage/node/internal.sock", "off" // pod:oasis-client
// grpcAddress, grpcTlsMode := "127.0.0.1:9000", "off" // pod:nginx-proxy
// grpcAddress, grpcTlsMode := "mainnet-grpc-v2202-oasis-client.mainnet.svc:9000", "off" // svc
// grpcAddress, grpcTlsMode := "mainnet-grpc-v2202.gcp-stag-usw2.local:443", "ca" // ing-local
// grpcAddress, grpcTlsMode := "testnet-grpc-v2202.oasiscloud-staging.net:443", "system" // ing
grpcTlsCustomCa := "ca.crt"
grpcUsername := ""
grpcPassword := ""
switch len(os.Args) {
case 5:
grpcUsername = os.Args[3]
grpcPassword = os.Args[4]
fallthrough
case 3:
grpcAddress = os.Args[1]
grpcTlsMode = os.Args[2]
}
grpcDialOpts := []grpc.DialOption{}
switch grpcTlsMode {
case "off": // non-encrypted connections
grpcDialOpts = append(grpcDialOpts, grpc.WithInsecure())
case "off-alt": // non-encrypted connections (alternative)
grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(insecure.NewCredentials()))
case "insecure": // TLS without certificate checks
grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(credentials.NewTLS(&tls.Config{InsecureSkipVerify: true})))
case "ca": // TLS with custom CA
certCreds, _ := credentials.NewClientTLSFromFile(grpcTlsCustomCa, "")
grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(certCreds))
// Alternative: TLS with custom CA (alternative)
// caPem, _ := ioutil.ReadFile(grpcTlsCustomCa)
// certPool := x509.NewCertPool()
// err := certPool.AppendCertsFromPEM(caPem)
// grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(credentials.NewClientTLSFromCert(certPool, "")))
case "system": // TLS with system certificates and TLS v1.2
grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(credentials.NewTLS(&tls.Config{MinVersion: tls.VersionTLS12})))
case "system-alt": // TLS with system certificates (alternative)
certPool, _ := x509.SystemCertPool()
grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(credentials.NewClientTLSFromCert(certPool, "")))
}
if grpcUsername != "" {
grpcDialOpts = append(grpcDialOpts, grpc.WithPerRPCCredentials(basicAuth{
username: grpcUsername,
password: grpcPassword,
}))
}
// Connect to gRPC
ctx := context.Background()
fmt.Println("connect to", grpcAddress)
nodeConn, err := oasisGrpc.Dial(grpcAddress, grpcDialOpts...)
exitErr(err)
cons := consensus.NewConsensusClient(nodeConn)
// Get chain context
chainContext, err := cons.GetChainContext(ctx)
exitErr(err)
fmt.Println("chain context", chainContext)
// Get node status (genesis height and latest height)
status, err := cons.GetStatus(ctx)
exitErr(err)
fmt.Println("latest height", status.LatestHeight)
fmt.Println("node status", status)
// Dumping consensus state to genesis file
doc, err := cons.StateToGenesis(ctx, status.LatestHeight)
exitErr(err)
canonJSON, err := doc.CanonicalJSON()
exitErr(err)
fmt.Println("genesis dump", canonJSON[:200], "...")
}
Update regarding timeouts: I have been trying all possible things and it seems there is an gRPC-specific issue in the ingress-nginx-controller itself. Details:
/etc/nginx/nginx.conf
of the ingress-nginx-controller.ingress-nginx-tmp
(ingress class nginx-tmp
) in gcp/staging/us-west2
, tried everything with it.--v=5
), output indicates an internal delete timer triggers the closing of the conneciton:
2022/11/23 11:24:48 [debug] 27#27: *405 event timer del: 19: 1885333117
2022/11/23 11:24:48 [debug] 27#27: *405 http upstream request: "/oasis-core.Consensus/StateToGenesis?"
2022/11/23 11:24:48 [debug] 27#27: *405 http upstream process header
2022/11/23 11:24:48 [debug] 27#27: *405 http next upstream, 4
2022/11/23 11:24:48 [debug] 27#27: *405 free keepalive peer
2022/11/23 11:24:48 [debug] 27#27: *405 lua balancer free peer, tries: 2
2022/11/23 11:24:48 [error] 27#27: *405 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.20.3.234, server: tmp-mainnet-grpc-v2202.gcp-stag-usw2.local, request: "POST /oasis-core.Consensus/StateToGenesis HTTP/2.0", upstream: "grpc://10.21.2.52:9000", host: "tmp-mainnet-grpc-v2202.gcp-stag-usw2.local:443"
2022/11/23 11:24:48 [debug] 27#27: *405 finalize http upstream request: 504
10.20.3.234 - - [23/Nov/2022:11:24:48 +0000] "POST /oasis-core.Consensus/StateToGenesis HTTP/2.0" 504 160 "-" "grpc-go/1.46.2" 43 60.001 [mainnet-mainnet-grpc-v2202-oasis-client-oasis-client] [] 10.21.2.52:9000 0 60.001 504 25c4e7891d507ed6b3e4e0c411bd4a09
2022/11/23 11:24:48 [debug] 27#27: *405 finalize grpc request
2022/11/23 11:24:48 [debug] 27#27: *405 close http upstream connection: 19
timeoutpy
produces output after 90s (also 10min), tested with curl
), set up an K8s Ingress with HTTP backend, and everything works there, works even with TLS termination.timeoutgo
produces output after 90s, tested with grpcurl
). It seems to just hang, no timeout, but also no response... In progress...I came out of the rabbit hole. Contrary to docs, the Nginx server does not apply the proxy_read_timeout
setting when gRPC connections are being used, and consequently gRPC parameters must be set explicitly. It seems that incoming gRPC connections and backend gRPC connections do not share the settings. Already applied PR https://github.com/oasislabs/private-ops/pull/2601.
## in `gcp/production/us-west2`
$ time GODEBUG=http2debug=2 ./grpc-client mainnet-grpc-v2202.gcp-prod-usw2.local:443 ca
...
genesis dump [123 10 32 32 34 104 101 105 103 104 116 34 58 32 49 49 51 52 57 56 49 49 44 10 32 32 34 103 101 110 101 115 105 115 95 116 105 109 101 34 58 32 34 50 48 50 50 45 49 49 45 50 51 84 49 56 58 48 53 58 53 51 90 34 44 10 32 32 34 99 104 97 105 110 95 105 100 34 58 32 34 111 97 115 105 115 45 51 34 44 10 32 32 34 114 101 103 105 115 116 114 121 34 58 32 123 10 32 32 32 32 34 112 97 114 97 109 115 34 58 32 123 10 32 32 32 32 32 32 34 103 97 115 95 99 111 115 116 115 34 58 32 123 10 32 32 32 32 32 32 32 32 34 100 101 114 101 103 105 115 116 101 114 95 101 110 116 105 116 121 34 58 32 49 48 48 48 44 10 32 32 32 32 32 32 32 32 34 114 101 103 105 115 116 101 114 95 101 110 116] ...
real 21m57.070s
user 0m4.133s
sys 0m1.432s
Periodically run the
tests/genesis
test (which performs a snapshot of the node state and diffs it against indexer db; sadly only for consensus state).The test occupies the node for 5-10 minutes, according to Nikhil. First see (with a local node?) if we can do this against the regular node, or if this is too disruptive. If it is disruptive, maybe let's still do it the easy way (= same node), but only once a week or day, in the quiet hours.