Cron job that tests equality with node-reported balances

oasisprotocol / nexus

Official indexer for the Oasis Network.

Apache License 2.0

15 stars 4 forks source link

Cron job that tests equality with node-reported balances #160

Closed mitjat closed 1 year ago

mitjat commented 2 years ago

Periodically run the tests/genesis test (which performs a snapshot of the node state and diffs it against indexer db; sadly only for consensus state).

The test occupies the node for 5-10 minutes, according to Nikhil. First see (with a local node?) if we can do this against the regular node, or if this is too disruptive. If it is disruptive, maybe let's still do it the easy way (= same node), but only once a week or day, in the quiet hours.

mitjat commented 2 years ago

Some relevant info from Slack:

1) This operation is costly. I ran it 20 minutes ago and it just finished. My steps:

kubectl -n mainnet exec -i -t mainnet-all-v2202-oasis-client-0 -- bash
oasis-node control status -a unix:/storage/node/internal.sock | grep latest_height
time oasis-node genesis dump --height 10957153 --genesis.file /tmp/mydump.json -a unix:/storage/node/internal.sock
real    18m10.842s

and the resulting file is 96MB in size (60MB with no pretty-printing). The success here does mean that the linux socket doesn't impose timeouts.

2) Might be a GRPC timeout: The grpc server we want seems to start here, which then fails to explicitly set a timeout either here or 25 lines higher (not sure which one we end up using). That said, Andy reports things timing out after 70-90 seconds.

3) @gw0 fixed nginx timeouts, also with no observable change in behavior. Quote: There are basically two Nginx instances between the endpoint and oasis-node. :slightly_smiling_face: Ingress is implemented with the nginx-ingress-controller that has a timeout (overridable with annotations, like https://github.com/oasislabs/private-charts/blob/master/charts/parcel-gateway/templates/ingress.yaml#L18-L20, docs https://docs.nginx.com/nginx-ingress-controller/configuration/ingress-resources/advanced-configuration-with-annotations/#summary-of-annotations) it sends traffic to the K8s Service. To bypass it, you can port-forward to the K8s Service. But there is another Nginx in a container inside the charts/oasis-client, its purpose is to proxy TCP/HTTP traffic to the internal.sock UNIX socket. This one also sets some timeouts, see https://github.com/oasislabs/private-charts/blob/master/charts/oasis-client/templates/nginx_config.yaml. [...] increased both Ingress and internal nginx-proxy timeouts to 1 hour (https://github.com/oasislabs/private-ops/pull/2592/commits/9e9cf11e2629cc097ada23c5d8cf442b5a77cfd7), already applied to new *-all nodes.

4) nginx-proxy logs contain an error line wiht HTTP 499, which means that the client (!) closed the connection while the server was still processing it.

(k8s: gcp/production/us-west2) root@ca4d2e1059f2 /workdir# kubectl -n mainnet logs pods/mainnet-grpc-v2202-oasis-client-0 --container nginx-proxy | grep -v status:200                                                                                               [47/16193]
remote_addr:10.31.3.33  time_local:28/Oct/2022:19:23:03 +0000   host:upstream_balancer  status:499      bytes_sent:0    useragent:grpc-go/1.48.0        request_time:0.006                                                                                                     
remote_addr:10.31.3.33  time_local:28/Oct/2022:19:23:27 +0000   host:upstream_balancer  status:499      bytes_sent:0    useragent:grpc-go/1.48.0        request_time:0.002                                                                                                     
remote_addr:10.31.3.33  time_local:28/Oct/2022:19:23:56 +0000   host:upstream_balancer  status:499      bytes_sent:0    useragent:grpc-go/1.48.0        request_time:0.001

@gw0 do you have time to pair on this with @Andrew7234 ? He is currently stuck.

We don't have a node up in staging or prod right now, but you can deploy a fresh one easily; I hope/think its state will be immediately large enough to time out for StateToGenesis calls.

mitjat commented 2 years ago

https://app.clickup.com/t/3u9ym4c

gw0 commented 2 years ago

There are many "dump" commands available in oasis-node: https://github.com/oasisprotocol/oasis-core/blob/master/go/oasis-node/cmd/debug/dumpdb/dumpdb.go

gw0 commented 2 years ago

Dumping a code snippet here in order to not waste time while debugging a specific gRPC call....

package main

import (
    "context"
    "fmt"
    "os"
    "runtime/debug"
    "google.golang.org/grpc"
    oasisGrpc "github.com/oasisprotocol/oasis-core/go/common/grpc"
    consensus "github.com/oasisprotocol/oasis-core/go/consensus/api"
)

const (
  //grpcAddress = "unix:/storage/node/internal.sock"
  grpcAddress = "127.0.0.1:9000"
)

func exitErr(err error) {
  if err != nil {
    fmt.Println("err:", err)
    debug.PrintStack()
    os.Exit(1)
  }
}

func main() {
  // Connect to gRPC
  ctx := context.Background()
  fmt.Println("connect to", grpcAddress)
  nodeConn, err := oasisGrpc.Dial(grpcAddress, grpc.WithInsecure())
  exitErr(err)
  cons := consensus.NewConsensusClient(nodeConn)

  // Get node status (genesis height and latest height)
  status, err := cons.GetStatus(ctx)
  exitErr(err)
  fmt.Println("latest height", status.LatestHeight)

  // Dumping consensus state to genesis file
  doc, err := cons.StateToGenesis(ctx, status.LatestHeight)
  exitErr(err)
  canonJSON, err := doc.CanonicalJSON()
  exitErr(err)
  fmt.Println("genesis dump", canonJSON[:200], "...")
}

gw0 commented 2 years ago

I use the above code snippet to performed an experiment inside pod/mainnet-grpc-v2202-oasis-client-tmp-0 in gcp/staging/us-west2. Everything worked when connecting directly to the UNIX socket (unix:/storage/node/internal.sock), but it timed out after 60 sec when connecting to the nginx-proxy sidecar container (127.0.0.1:9000).

$ kmn logs pod/mainnet-grpc-v2202-oasis-client-tmp-0 -c nginx-proxy
2022/11/11 10:44:25 [error] 7#7: *6 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: , request: "POST /oasis-core.Consensus/StateToGenesis HTTP/2.0", upstream: "grpc://unix:/storage/node/internal.sock", host: "127.0.0.1:9000"
remote_addr:127.0.0.1   time_local:11/Nov/2022:10:44:25 +0000   host:127.0.0.1  status:504  bytes_sent:167  useragent:grpc-go/1.46.2    request_time:60.060

It turned out the timeout configuration in nginx-proxy did not work as expected (it configured HTTP proxy timeouts, instead of gRPC proxy timeouts). Manually tested that the fix in PR https://github.com/oasislabs/private-charts/pull/544 works.

Andrew7234 commented 1 year ago

Tried re-running the indexer healthcheck and it's still timing out with 504 errors after ~70s:

genesis_test.go:228: staking genesis error: rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 504 (Gateway Timeout); transport: received unexpected content-type "text/html"
    genesis_test.go:229:
            Error Trace:    /oasis-indexer/tests/genesis/genesis_test.go:229
            Error:          Expected nil, but got: &status.Error{s:(*status.Status)(0xc0001180e0)}
            Test:           TestGenesisFull
--- FAIL: TestGenesisFull (69.21s)

Gregor's tests indicate that the oasis-client nginx instance is working correctly. I thought that the issue might be the other nginx instance for the ingress; but the docs indicate that setting the proxy-read-timeout to 3600s should also set the grpc-read-timeout as well.

mitjat commented 1 year ago

Bummer. And very surprising to see Gregor's results, given your documentation find ...

Some more tests come to mind:

To double-check, can you repro Gregor's success (using his script by running locally on the node?
Can you repro Gregor's success at a lower height? IIRC the most recent heights are more cached, so maybe Gregor's successful test just happened to take less than minute.
Can you repro success using Gregor's script from the indexer container? If so, the problem is in how the indexer test configures its node access.

Ah, also: The error you show is 504 (Gateway Timeout); transport: received unexpected content-type "text/html", but I expect this change should stop text/html responses from nginx. Either that config change doesn't do what I expected it would do, or maybe you're running against an non-updated node? Note that the -tmp deployment is not updated yet.

Andrew7234 commented 1 year ago

Hm I was testing against the ingress mainnet-grpc-v2202.gcp-prod-usw2.local:443; perhaps it was pointing to the tmp deployment? Will investigate further

Can you repro Gregor's success at a lower height?

This shouldn't be a factor currently because the indexer is synced and thus the healthcheck tries to fetch the state at a recent height.

Can you repro success using Gregor's script from the indexer container?

Tried from the production indexer container: was unable to connect. Will keep digging

root@mainnet-oasis-indexer-c65bd9bbf-hkdvf:/go/src# ./node-test
connect to mainnet-grpc-v2202.gcp-prod-usw2.local:443
err: rpc error: code = Unavailable desc = connection closed before server preface received
goroutine 1 [running]:
runtime/debug.Stack()
    /usr/local/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
    /usr/local/go/src/runtime/debug/stack.go:16 +0x19
main.exitErr({0xbb77a0?, 0xc0000ac038?})
    /go/src/test.go:22 +0x85
main.main()
    /go/src/test.go:37 +0x135

Edit: Doesn't look like it was pointing to the tmp deployment when I tested this morning.

gw0 commented 1 year ago

maybe Gregor's successful test just happened to take less than minute.

No, connection to 127.0.0.1:9000 (pod:nginx-proxy) took quite a while to dump the genesis file. I just reconfirmed that it also does not timeout on gcp/prod/usw2 (yesterday I was testing on gcp/stag/usw2).

Let me continue debugging on gcp/prod/usw2 from inside the mainnet-grpc-v2202-oasis-client-0 pod (compiled binaries in /tmp) with the following gRPC addresses:

  grpcAddress = "unix:/storage/node/internal.sock"  // pod:oasis-client -- OK
  grpcAddress = "127.0.0.1:9000" // pod:nginx-proxy -- OK
  grpcAddress = "mainnet-grpc-v2202-oasis-client.mainnet.svc:9000"  // svc -- OK
  grpcAddress = "10.30.32.2:443" // ing-ip -- "connection closed before server preface received"
  grpcAddress = "mainnet-grpc-v2202.gcp-prod-usw2.local:443" // ing -- "connection closed before server preface received"

Ingress controller nginx-private logs are full of:

$ k logs pod/nginx-ingress-nginx-private-controller-55f57d4bc-v54ph
2022/11/15 14:16:06 [error] 747#747: *14721058 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: 10.30.0.38, server: mainnet-grpc-v2202.gcp-prod-usw2.local, request: "POST /oasis-core.Consensus/GetBlock HTTP/2.0", upstream: "grpcs://10.31.3.32:9000", host: "mainnet-grpc-v2202.gcp-prod-usw2.local:443"

nginx-ingress-nginx-private-controller-55f57d4bc-v54ph is using nginx/1.19.1 (from 2018... gulp). mainnet-grpc-v2202-oasis-client-0 pod is using nginx/1.23.2. gRPC support was added in Nginx 1.13.10 (https://www.nginx.com/blog/nginx-1-13-10-grpc/) and TLSv1.3 was added in Nginx 1.13 (https://www.howtoforge.com/how-to-enable-tls-13-in-nginx/), so this should not be the reason.

gw0 commented 1 year ago

@ravenac95 Some time ago, we decided to use Kong ingress controller for gRPC calls. Was this due to gRPC issues with Nginx ingress controller?

ravenac95 commented 1 year ago

@ravenac95 Some time ago, we decided to use Kong ingress controller for gRPC calls. Was this due to gRPC issues with Nginx ingress controller?

Hrm. So we used kong on parcel for reasons unrelated to grpc at the time. We do, however, use envoy for the grpc frontend on our public grpc endpoints. It just had the features we needed. Envoy was more purposefully built with grpc as an intended backend. Perhaps it would behoove us to use that internally for this.

gw0 commented 1 year ago

@Andrew7234 In following days I will experimentally deploy a newer version of the Nginx ingress controller, so that we can see if version 1.19.1 is causing issues. In the meantime, I believe it is best to use the K8s Service directly. So, please switch the tests to using mainnet-grpc-v2202-oasis-client.mainnet.svc:9000.

mitjat commented 1 year ago

Thank you. @Andrew7234 note that the Service (:9000) is not encrypted, and depending on which oasis-core API you use to connect, the connection function might assume that everything except unix sockets has encrypted traffic. So if you get errors related to TLS, you'll have to use the lower-level oasisGrpc.Dial() from Gregor's example above, or adapt in some other way.

gw0 commented 1 year ago

Still not sure what is going on with "connection closed before server preface received" (maybe Nginx version issue, or it expects a TLS/non-TLS connection)...

But here is the confirmation that regarding the 60s timeout we indeed misconfigured the Ingress (by looking at the nginx-private ingress controller internal configuration):

## before with original Ingress annotations
$ kmn describe ing/mainnet-grpc-v2202-oasis-client
                                          nginx.ingress.kubernetes.io/proxy-connect-timeout: 3600s
                                          nginx.ingress.kubernetes.io/proxy-read-timeout: 3600s
                                          nginx.ingress.kubernetes.io/proxy-send-timeout: 3600s
$ k exec -it pod/nginx-ingress-nginx-private-controller-55f57d4bc-v54ph -- cat /etc/nginx/nginx.conf | grep -A200 '## start server mainnet-grpc-v2202.gcp-prod-usw2.local' | grep -B200 '## end server mainnet-grpc-v2202.gcp-prod-usw2.local'
            proxy_connect_timeout                   5s;
            proxy_send_timeout                      60s;
            proxy_read_timeout                      60s;

## after removing the "s" in Ingress annotations
$ kmn describe ing/mainnet-grpc-v2202-oasis-client
                                          nginx.ingress.kubernetes.io/proxy-connect-timeout: 3600
                                          nginx.ingress.kubernetes.io/proxy-read-timeout: 3600
                                          nginx.ingress.kubernetes.io/proxy-send-timeout: 3600
$ k exec -it pod/nginx-ingress-nginx-private-controller-55f57d4bc-v54ph -- cat /etc/nginx/nginx.conf | grep -A200 '## start server mainnet-grpc-v2202.gcp-prod-usw2.local' | grep -B200 '## end server mainnet-grpc-v2202.gcp-prod-usw2.local'
            proxy_connect_timeout                   3600s;
            proxy_send_timeout                      3600s;
            proxy_read_timeout                      3600s;

gw0 commented 1 year ago

Dumping the code snippet for debugging the timeout issue:

package main

import (
    "context"
    "fmt"
    "os"
    "runtime/debug"
    "time"
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials"
    "google.golang.org/grpc/credentials/insecure"
    "crypto/x509"
    "crypto/tls"
    "encoding/base64"
    oasisGrpc "github.com/oasisprotocol/oasis-core/go/common/grpc"
    consensus "github.com/oasisprotocol/oasis-core/go/consensus/api"
)

func exitErr(err error) {
  if err != nil {
    fmt.Println("err:", err)
    debug.PrintStack()
    os.Exit(1)
  }
}

type basicAuth struct {
    username string
    password string
}

func (b basicAuth) GetRequestMetadata(ctx context.Context, in ...string) (map[string]string, error) {
    auth := b.username + ":" + b.password
    enc := base64.StdEncoding.EncodeToString([]byte(auth))
    return map[string]string{
        "authorization": "Basic " + enc,
    }, nil
}

func (basicAuth) RequireTransportSecurity() bool {
    return true
}

func main() {
  var utcZone = time.FixedZone("UTC", 0) 
  time.Local = utcZone

  // Parse options
  grpcAddress, grpcTlsMode := "unix:/storage/node/internal.sock", "off"  // pod:oasis-client
  // grpcAddress, grpcTlsMode := "127.0.0.1:9000", "off" // pod:nginx-proxy
  // grpcAddress, grpcTlsMode := "mainnet-grpc-v2202-oasis-client.mainnet.svc:9000", "off"  // svc
  // grpcAddress, grpcTlsMode := "mainnet-grpc-v2202.gcp-stag-usw2.local:443", "ca" // ing-local
  // grpcAddress, grpcTlsMode := "testnet-grpc-v2202.oasiscloud-staging.net:443", "system" // ing
  grpcTlsCustomCa := "ca.crt"
  grpcUsername := ""
  grpcPassword := ""
  switch len(os.Args) {
  case 5:
    grpcUsername = os.Args[3]
    grpcPassword = os.Args[4]
    fallthrough
  case 3:
    grpcAddress = os.Args[1]
    grpcTlsMode = os.Args[2]
  }
  grpcDialOpts := []grpc.DialOption{}
  switch grpcTlsMode {
  case "off":  // non-encrypted connections
    grpcDialOpts = append(grpcDialOpts, grpc.WithInsecure())

  case "off-alt":  // non-encrypted connections (alternative)
    grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(insecure.NewCredentials()))

  case "insecure":  // TLS without certificate checks
    grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(credentials.NewTLS(&tls.Config{InsecureSkipVerify: true})))

  case "ca":  // TLS with custom CA
    certCreds, _ := credentials.NewClientTLSFromFile(grpcTlsCustomCa, "")
    grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(certCreds))

    // Alternative: TLS with custom CA (alternative)
    // caPem, _ := ioutil.ReadFile(grpcTlsCustomCa)
    // certPool := x509.NewCertPool()
    // err := certPool.AppendCertsFromPEM(caPem)
    // grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(credentials.NewClientTLSFromCert(certPool, "")))

  case "system":  // TLS with system certificates and TLS v1.2
    grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(credentials.NewTLS(&tls.Config{MinVersion: tls.VersionTLS12})))

  case "system-alt":  // TLS with system certificates (alternative)
    certPool, _ := x509.SystemCertPool()
    grpcDialOpts = append(grpcDialOpts, grpc.WithTransportCredentials(credentials.NewClientTLSFromCert(certPool, "")))
  }
  if grpcUsername != "" {
    grpcDialOpts = append(grpcDialOpts, grpc.WithPerRPCCredentials(basicAuth{
      username: grpcUsername,
      password: grpcPassword,
    }))
  }

  // Connect to gRPC
  ctx := context.Background()
  fmt.Println("connect to", grpcAddress)
  nodeConn, err := oasisGrpc.Dial(grpcAddress, grpcDialOpts...)
  exitErr(err)
  cons := consensus.NewConsensusClient(nodeConn)

  // Get chain context
  chainContext, err := cons.GetChainContext(ctx)
  exitErr(err)
  fmt.Println("chain context", chainContext)

  // Get node status (genesis height and latest height)
  status, err := cons.GetStatus(ctx)
  exitErr(err)
  fmt.Println("latest height", status.LatestHeight)
  fmt.Println("node status", status)

  // Dumping consensus state to genesis file
  doc, err := cons.StateToGenesis(ctx, status.LatestHeight)
  exitErr(err)
  canonJSON, err := doc.CanonicalJSON()
  exitErr(err)
  fmt.Println("genesis dump", canonJSON[:200], "...")
}

gw0 commented 1 year ago

Update regarding timeouts: I have been trying all possible things and it seems there is an gRPC-specific issue in the ingress-nginx-controller itself. Details:

Using the K8s Service endpoint directly always works, but the K8s Ingress with gRPC backend always times out after ~60s, it does not work with internal or external GCP TCP load balancers, not even when directly connecting to ingress-nginx-controller pod, and no matter how timeouts and buffer sizes are set in /etc/nginx/nginx.conf of the ingress-nginx-controller.
I even set up the latest ingress-nginx-controller ingress-nginx-tmp (ingress class nginx-tmp) in gcp/staging/us-west2, tried everything with it.

Running the ingress-nginx-controller in debug mode (--v=5), output indicates an internal delete timer triggers the closing of the conneciton:

2022/11/23 11:24:48 [debug] 27#27: *405 event timer del: 19: 1885333117
2022/11/23 11:24:48 [debug] 27#27: *405 http upstream request: "/oasis-core.Consensus/StateToGenesis?"
2022/11/23 11:24:48 [debug] 27#27: *405 http upstream process header
2022/11/23 11:24:48 [debug] 27#27: *405 http next upstream, 4
2022/11/23 11:24:48 [debug] 27#27: *405 free keepalive peer
2022/11/23 11:24:48 [debug] 27#27: *405 lua balancer free peer, tries: 2
2022/11/23 11:24:48 [error] 27#27: *405 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.20.3.234, server: tmp-mainnet-grpc-v2202.gcp-stag-usw2.local, request: "POST /oasis-core.Consensus/StateToGenesis HTTP/2.0", upstream: "grpc://10.21.2.52:9000", host: "tmp-mainnet-grpc-v2202.gcp-stag-usw2.local:443"
2022/11/23 11:24:48 [debug] 27#27: *405 finalize http upstream request: 504
10.20.3.234 - - [23/Nov/2022:11:24:48 +0000] "POST /oasis-core.Consensus/StateToGenesis HTTP/2.0" 504 160 "-" "grpc-go/1.46.2" 43 60.001 [mainnet-mainnet-grpc-v2202-oasis-client-oasis-client] [] 10.21.2.52:9000 0 60.001 504 25c4e7891d507ed6b3e4e0c411bd4a09
2022/11/23 11:24:48 [debug] 27#27: *405 finalize grpc request
2022/11/23 11:24:48 [debug] 27#27: *405 close http upstream connection: 19

To determine if it is gRPC-specific I set up a simple HTTP server with delayed output (timeoutpy produces output after 90s (also 10min), tested with curl), set up an K8s Ingress with HTTP backend, and everything works there, works even with TLS termination.
To determine if it is specific for Oasis-node gRPC, I modified the gRPC example https://kubernetes.github.io/ingress-nginx/examples/grpc/ to produce delayed outputs (timeoutgo produces output after 90s, tested with grpcurl). It seems to just hang, no timeout, but also no response... In progress...

gw0 commented 1 year ago

I came out of the rabbit hole. Contrary to docs, the Nginx server does not apply the proxy_read_timeout setting when gRPC connections are being used, and consequently gRPC parameters must be set explicitly. It seems that incoming gRPC connections and backend gRPC connections do not share the settings. Already applied PR https://github.com/oasislabs/private-ops/pull/2601.

gw0 commented 1 year ago

## in `gcp/production/us-west2`
$ time GODEBUG=http2debug=2 ./grpc-client mainnet-grpc-v2202.gcp-prod-usw2.local:443 ca
...
genesis dump [123 10 32 32 34 104 101 105 103 104 116 34 58 32 49 49 51 52 57 56 49 49 44 10 32 32 34 103 101 110 101 115 105 115 95 116 105 109 101 34 58 32 34 50 48 50 50 45 49 49 45 50 51 84 49 56 58 48 53 58 53 51 90 34 44 10 32 32 34 99 104 97 105 110 95 105 100 34 58 32 34 111 97 115 105 115 45 51 34 44 10 32 32 34 114 101 103 105 115 116 114 121 34 58 32 123 10 32 32 32 32 34 112 97 114 97 109 115 34 58 32 123 10 32 32 32 32 32 32 34 103 97 115 95 99 111 115 116 115 34 58 32 123 10 32 32 32 32 32 32 32 32 34 100 101 114 101 103 105 115 116 101 114 95 101 110 116 105 116 121 34 58 32 49 48 48 48 44 10 32 32 32 32 32 32 32 32 34 114 101 103 105 115 116 101 114 95 101 110 116] ...
real    21m57.070s
user    0m4.133s
sys 0m1.432s

Andrew7234 commented 1 year ago

Closed by https://github.com/oasislabs/private-ops/pull/2596