Node add isn't consistently working

gsstoykov commented 1 month ago

To Reproduce

Initialisation steps from https://github.com/hashgraph/solo/issues/727 and:

npm run solo -- node add --gossip-keys true --tls-keys true --release-tag v0.54.0-alpha.4 --namespace solo-e2e

Describe the bug

◼ Finalize
node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<ErrorEvent>".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Node.js v21.7.1

I've seen fails with the error from https://github.com/hashgraph/solo/issues/727 as well.

Describe the expected behavior

Node added and functioning. Does not happen every time but still it is not consistent for testing from our side.

Whole JUnit/CLI Logs

npm run solo -- node add --gossip-keys true --tls-keys true --release-tag v0.54.0-alpha.4 --namespace solo-e2e

> @hashgraph/solo@0.31.0 solo
> NODE_OPTIONS=--experimental-vm-modules node --no-deprecation solo.mjs node add --gossip-keys true --tls-keys true --release-tag v0.54.0-alpha.4 --namespace solo-e2e

******************************* Solo *********************************************
Version         : 0.31.0
Kubernetes Context  : kind-solo-e2e
Kubernetes Cluster  : kind-solo-e2e
Kubernetes Namespace    : solo-e2e
**********************************************************************************
✔ Initialize [0.1s]
✔ Check that PVCs are enabled
✔ Identify existing network nodes
  ✔ Check network pod: node1
✔ Determine new node account number
✔ Generate Gossip key [0.3s]
  ✔ Backup old files
  ✔ Gossip key for node: node2 [0.3s]
✔ Generate gRPC TLS key [0.4s]
  ✔ Backup old files
  ✔ TLS key for node: node2 [0.4s]
✔ Load signing key certificate
✔ Compute mTLS certificate hash
✔ Prepare gossip endpoints
✔ Prepare grpc service endpoints
✔ Prepare upgrade zip file for node upgrade process [2s]
✔ Check existing nodes staked amount [2s]
✔ Send node create transaction [2s]
✔ Send prepare upgrade transaction [4s]
✔ Send freeze upgrade transaction [2s]
✔ Download generated files from an existing node [0.5s]
✔ Prepare staging directory
  ✔ Copy Gossip keys to staging
  ✔ Copy gRPC TLS keys to staging
✔ Copy node keys to secrets [0.1s]
  ✔ Copy TLS keys [0.1s]
  ✔ Node: node1
    ✔ Copy Gossip keys
  ✔ Node: node2
    ✔ Copy Gossip keys
✔ Check network nodes are frozen [9s]
  ✔ Check network pod: node1  - status FREEZE_COMPLETE, attempt: 3/120 [9s]
✔ Get node logs and configs [2s]
✔ Deploy new network node [5s]
✔ Kill nodes to pick up updated configMaps
✔ Check node pods are running [58s]
  ✔ Check Node: node1
  ✔ Check Node: node2 [58s]
❯ Fetch platform software into all network nodes
  ⠇ Update node: node1 [ platformVersion = v0.54.0-alpha.4 ]
  ⠇ Update node: node2 [ platformVersion = v0.54.0-alpha.4 ]
◼ Download last state from an existing node
◼ Upload last saved state to new network node
◼ Setup new network node
◼ Start network nodes
◼ Enable port forwarding for JVM debugger
◼ Check all nodes are ACTIVE
◼ Check all node proxies are ACTIVE
◼ Stake new node
◼ Trigger stake weight calculate
◼ Finalize
node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<ErrorEvent>".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Node.js v21.7.1

Additional Context

No response

gsstoykov commented 1 week ago

Also tried doing the same flow with the C++ SDK NodeCreateTransaction followed by npm run solo -- node add-execute --input-dir context. Seems like the node pod is correctly created also setup and start are passing as well but got the following log:

npm run solo -- node add-execute --input-dir context

> @hashgraph/solo@0.31.0 solo
> NODE_OPTIONS=--experimental-vm-modules node --no-deprecation solo.mjs node add-execute --input-dir context

******************************* Solo *********************************************
Version         : 0.31.0
Kubernetes Context  : kind-solo-e2e
Kubernetes Cluster  : kind-solo-e2e
Kubernetes Namespace    : solo-e2e
**********************************************************************************
✔ Initialize [0.1s]
✔ Identify existing network nodes
  ✔ Check network pod: node1
✔ Load context data
✔ Download generated files from an existing node [0.4s]
✔ Prepare staging directory
  ✔ Copy Gossip keys to staging
  ✔ Copy gRPC TLS keys to staging
✔ Copy node keys to secrets
  ✔ Copy TLS keys
  ✔ Node: node1
    ✔ Copy Gossip keys
  ✔ Node: node2
    ✔ Copy Gossip keys
✔ Check network nodes are frozen [6s]
  ✔ Check network pod: node1  - status FREEZE_COMPLETE, attempt: 0/120 [6s]
✔ Get node logs and configs [8s]
✔ Deploy new network node [2s]
✔ Kill nodes to pick up updated configMaps
✔ Check node pods are running [30s]
  ✔ Check Node: node1
  ✔ Check Node: node2 [30s]
✔ Fetch platform software into all network nodes [5s]
  ✔ Update node: node1 [ platformVersion = v0.54.0-alpha.4 ] [5s]
  ✔ Update node: node2 [ platformVersion = v0.54.0-alpha.4 ] [5s]
✔ Download last state from an existing node [0.4s]
✔ Upload last saved state to new network node [0.4s]
✔ Setup new network node [0.1s]
  ✔ Node: node1 [0.1s]
    ✔ Set file permissions [0.1s]
  ✔ Node: node2
    ✔ Set file permissions
✔ Start network nodes [0.1s]
  ✔ Start node: node1
  ✔ Start node: node2
↓ Enable port forwarding for JVM debugger
❯ Check all nodes are ACTIVE
  ✔ Check network pod: node1  - status ACTIVE, attempt: 16/120 [24s]
  ✖ node 'node2' is not ACTIVE[ attempt = 120/120 ]
◼ Check all node proxies are ACTIVE
◼ Stake new node
◼ Trigger stake weight calculate
◼ Finalize
*********************************** ERROR *****************************************
Error in setting up nodes: node 'node2' is not ACTIVE[ attempt = 120/120 ]
***********************************************************************************

jeromy-cannon commented 2 days ago

We discovered there is currently an issue in platform/services with NodeCreateTransaction. After the node has been added and the one of the nodes goes into teach mode for the newly added node, the teacher will get JVM out of memory errors after finishing teaching and reconnecting to the network. I'm not sure the exact amount, but Nathan quoted 22GB of memory (not sure what this 22GB refers to). I think you might be able to get around this by setting the JVM memory settings really high, but we haven't configured Solo to do that by default.

We have disabled our E2E tests involving solo node add until this is resolved in a patch. I'm reaching out to find an issue that we can use to track this with.

hashgraph / solo