stellar / stellar-rpc

RPC server for Soroban contracts.
19 stars 20 forks source link

Enhance RPC Health Status #148

Open sreuland opened 7 months ago

sreuland commented 7 months ago

What problem does your feature solve?

What would you like to see?

note - these requirements were elided from design discussion on https://github.com/stellar/kube/pull/2098#pullrequestreview-2005913742

The new endpoint supports a notion of QoS levels for representing the different potential run time states that RPC can be in:

level 1 - service is completely unhealthy, the process is running but ingestion isn't stable yet to network, unable to process requests. level 2 - service is running and forward ingestion with network is happening, data retention window is not fully caught up yet, but can process some json-rpc request endpoints. level 3 - service is running, forward ingestion with network is happening, data retention window is full, all rpc request endpoints are up.

What alternatives are there?

use the current json-rpc getHealth

overcat commented 2 weeks ago

I strongly support this feature. When I was configuring failover for sorobanrpc.com, I had to write an additional simple API service to proxy the getHealth interface, and then have the health checker access this API service. If soroban-rpc supported direct GET access to the getHealth interface, I wouldn't need to an extra API service.

(I'm unsure how many health checkers support posting JSON body during their health checks.)

mollykarcher commented 3 days ago

After @overcat's comment, I'm realizing that the link referenced in the issue description is to an internal repository, which defines the k8s manifests for SDF's service deployments. So it's not broadly public what we do, nor do we public any recommendations about this anywhere in our docs (which we should fix also!). We currently deploy it via k8s, and define the readinessProbe as follows:

readinessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - |
        curl -s --location --request POST 'http://127.0.0.1:8000/' \
          --header 'Content-Type: application/json' \
          --data-raw '{
            "jsonrpc": "2.0",
            "id": 10235,
            "method": "getHealth"
          }' | jq -es 'if (. | length) == 0 then null else .[0] end | .result | .status == "healthy" and (.latestLedger - .oldestLedger >= (.ledgerRetentionWindow - 10))' > /dev/null;
  failureThreshold: 1
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 2   

So we are effectively parsing out the ledger range in order to determine health of the instance.