stackabletech / hdfs-operator

Apache Hadoop HDFS operator for Stackable
Other
40 stars 4 forks source link

Adjust the liveness probes to allow longer startup times #488

Closed siegfriedweber closed 8 months ago

siegfriedweber commented 8 months ago

The current liveness probes require the pods to startup within 40 seconds:

livenessProbe:
  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 1
  tcpSocket:
    port: rpc
  timeoutSeconds: 1

Large HDFS clusters could require more time. In this case, pods are terminated too early, enter a startup loop, and the cluster never reaches a working state.

siegfriedweber commented 8 months ago

NameNode

Start sequence of a NameNode:

⇒ The RPC port 8020 is a good indicator for the readiness of a NameNode but not its liveness. The HTTP port 9870 should be used instead.

JournalNode

For the JournalNode, the RPC port 8485 is used for the liveness probe. This does not seem to be a problem because it is bound right after the HTTP port 8480, and no communication to other nodes is needed. However, the liveness probe could also be changed to the HTTP port.

Both ports were available after 8 seconds in a test:

server.Server (Server.java:doStart(415)) - Started @7574ms

DataNode

For the DataNode, the IPC port 9867 is used for the liveness probe. This does also not pose a problem because the IPC port is bound right after the HTTP port 9864 and both ports are available before the DataNode tries to connect to the NameNode.

The startup took about 8 seconds in a test.

Proposal

The thresholds should not be changed but the liveness probes should use the HTTP/HTTPS ports instead of the RPC/IPC ports.

@lfrancke, what do you think?