Closed siegfriedweber closed 8 months ago
Start sequence of a NameNode:
server.Server (Server.java:doStart(415)) - Started @3257ms
⇒ The RPC port 8020 is a good indicator for the readiness of a NameNode but not its liveness. The HTTP port 9870 should be used instead.
For the JournalNode, the RPC port 8485 is used for the liveness probe. This does not seem to be a problem because it is bound right after the HTTP port 8480, and no communication to other nodes is needed. However, the liveness probe could also be changed to the HTTP port.
Both ports were available after 8 seconds in a test:
server.Server (Server.java:doStart(415)) - Started @7574ms
For the DataNode, the IPC port 9867 is used for the liveness probe. This does also not pose a problem because the IPC port is bound right after the HTTP port 9864 and both ports are available before the DataNode tries to connect to the NameNode.
The startup took about 8 seconds in a test.
The thresholds should not be changed but the liveness probes should use the HTTP/HTTPS ports instead of the RPC/IPC ports.
@lfrancke, what do you think?
The current liveness probes require the pods to startup within 40 seconds:
Large HDFS clusters could require more time. In this case, pods are terminated too early, enter a startup loop, and the cluster never reaches a working state.