Inconsistent successful starts of Solo

gsstoykov commented 1 week ago

To Reproduce

rm -rf ~/.solo/cache
rm ~/.solo/solo.config
export SOLO_CLUSTER_NAME=solo-e2e
export SOLO_NAMESPACE=solo-e2e
export SOLO_CLUSTER_SETUP_NAMESPACE=fullstack-setup
kind delete cluster -n "${SOLO_CLUSTER_NAME}"
kind create cluster -n "${SOLO_CLUSTER_NAME}"
npm run solo -- init --namespace "${SOLO_NAMESPACE}" -i node1,node2 -s "${SOLO_CLUSTER_SETUP_NAMESPACE}"
npm run solo -- node keys --gossip-keys --tls-keys
npm run solo -- cluster setup
npm run solo -- network deploy --pvcs true
npm run solo -- node setup
npm run solo -- node start

Describe the bug

******************************* Solo *********************************************
Version         : 0.31.0
Kubernetes Context  : kind-solo-e2e
Kubernetes Cluster  : kind-solo-e2e
Kubernetes Namespace    : solo-e2e
**********************************************************************************
✔ Initialize
✔ Identify existing network nodes
  ✔ Check network pod: node1
✔ Starting nodes
  ✔ Start node: node1
↓ Enable port forwarding for JVM debugger
❯ Check nodes are ACTIVE
  ⠹ Check network pod: node1  - status TIMEOUT, attempt 0/120
◼ Check node proxies are ACTIVE
◼ Add node stakes
/Users/georgistoykov/Projects/solo/node_modules/@kubernetes/client-node/dist/web-socket-handler.js:72
            throw new Error("can't send data to ws");
                  ^

Error: can't send data to ws
    at WebSocketHandler.processData (/Users/georgistoykov/Projects/solo/node_modules/@kubernetes/client-node/dist/web-socket-handler.js:72:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /Users/georgistoykov/Projects/solo/node_modules/@kubernetes/client-node/dist/web-socket-handler.js:84:22

Node.js v21.7.1

It looks like the chance of failing start increases with increased node count.

Describe the expected behavior

Would expect solo nodes to run successfully.

Whole JUnit/CLI Logs

rrm -rf ~/.solo/cache
rm ~/.solo/solo.config
export SOLO_CLUSTER_NAME=solo-e2e
export SOLO_NAMESPACE=solo-e2e
export SOLO_CLUSTER_SETUP_NAMESPACE=fullstack-setup
kind delete cluster -n "${SOLO_CLUSTER_NAME}"
kind create cluster -n "${SOLO_CLUSTER_NAME}"
npm run solo -- init --namespace "${SOLO_NAMESPACE}" -i node1 -s "${SOLO_CLUSTER_SETUP_NAMESPACE}"
npm run solo -- node keys --gossip-keys --tls-keys
npm run solo -- cluster setup
npm run solo -- network deploy --pvcs true
npm run solo -- node setup
npm run solo -- node start
rm: /Users/georgistoykov/.solo/solo.config: No such file or directory
Deleting cluster "solo-e2e" ...
Deleted nodes: ["solo-e2e-control-plane"]
Creating cluster "solo-e2e" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-solo-e2e"
You can now use your cluster with:

kubectl cluster-info --context kind-solo-e2e

Thanks for using kind! 😊

> @hashgraph/solo@0.31.0 solo
> NODE_OPTIONS=--experimental-vm-modules node --no-deprecation solo.mjs init --namespace solo-e2e -i node1 -s fullstack-setup

******************************* Solo *********************************************
Version         : 0.31.0
Kubernetes Context  : kind-solo-e2e
Kubernetes Cluster  : kind-solo-e2e
Kubernetes Namespace    : solo-e2e
**********************************************************************************
✔ Setup home directory and cache
✔ Check dependencies [0.1s]
  ✔ Check dependency: helm [OS: darwin, Release: 23.2.0, Arch: arm64] [0.1s]
✔ Setup chart manager [3s]
✔ Copy templates in '/Users/georgistoykov/.solo/cache'

***************************************************************************************
Note: solo stores various artifacts (config, logs, keys etc.) in its home directory: /Users/georgistoykov/.solo
If a full reset is needed, delete the directory or relevant sub-directories before running 'solo init'.
***************************************************************************************

> @hashgraph/solo@0.31.0 solo
> NODE_OPTIONS=--experimental-vm-modules node --no-deprecation solo.mjs node keys --gossip-keys --tls-keys

******************************* Solo *********************************************
Version         : 0.31.0
Kubernetes Context  : kind-solo-e2e
Kubernetes Cluster  : kind-solo-e2e
Kubernetes Namespace    : solo-e2e
**********************************************************************************
✔ Initialize
✔ Generate gossip keys
  ✔ Backup old files
  ✔ Gossip key for node: node1 [0.1s]
✔ Generate gRPC TLS keys
  ✔ Backup old files
  ✔ TLS key for node: node1 [0.4s]
✔ Finalize

> @hashgraph/solo@0.31.0 solo
> NODE_OPTIONS=--experimental-vm-modules node --no-deprecation solo.mjs cluster setup

******************************* Solo *********************************************
Version         : 0.31.0
Kubernetes Context  : kind-solo-e2e
Kubernetes Cluster  : kind-solo-e2e
Kubernetes Namespace    : solo-e2e
**********************************************************************************
✔ Initialize
✔ Prepare chart values
✔ Install 'solo-cluster-setup' chart [1s]

> @hashgraph/solo@0.31.0 solo
> NODE_OPTIONS=--experimental-vm-modules node --no-deprecation solo.mjs network deploy --pvcs true

******************************* Solo *********************************************
Version         : 0.31.0
Kubernetes Context  : kind-solo-e2e
Kubernetes Cluster  : kind-solo-e2e
Kubernetes Namespace    : solo-e2e
**********************************************************************************
✔ Initialize
✔ Prepare staging directory
  ✔ Copy Gossip keys to staging
  ✔ Copy gRPC TLS keys to staging
✔ Copy node keys to secrets
  ✔ Copy TLS keys
  ✔ Node: node1
    ✔ Copy Gossip keys
✔ Install chart 'solo-deployment' [1s]
✔ Check node pods are running [2m44s]
  ✔ Check Node: node1 [2m44s]
✔ Check proxy pods are running
  ✔ Check HAProxy for: node1
  ✔ Check Envoy Proxy for: node1
✔ Check auxiliary pods are ready
  ✔ Check MinIO

> @hashgraph/solo@0.31.0 solo
> NODE_OPTIONS=--experimental-vm-modules node --no-deprecation solo.mjs node setup

******************************* Solo *********************************************
Version         : 0.31.0
Kubernetes Context  : kind-solo-e2e
Kubernetes Cluster  : kind-solo-e2e
Kubernetes Namespace    : solo-e2e
**********************************************************************************
✔ Initialize
✔ Identify network pods
  ✔ Check network pod: node1
✔ Fetch platform software into network nodes [4s]
  ✔ Update node: node1 [ platformVersion = v0.54.0-alpha.4 ] [4s]
✔ Setup network nodes [0.1s]
  ✔ Node: node1 [0.1s]
    ✔ Set file permissions [0.1s]

> @hashgraph/solo@0.31.0 solo
> NODE_OPTIONS=--experimental-vm-modules node --no-deprecation solo.mjs node start

******************************* Solo *********************************************
Version         : 0.31.0
Kubernetes Context  : kind-solo-e2e
Kubernetes Cluster  : kind-solo-e2e
Kubernetes Namespace    : solo-e2e
**********************************************************************************
✔ Initialize
✔ Identify existing network nodes
  ✔ Check network pod: node1
✔ Starting nodes
  ✔ Start node: node1
↓ Enable port forwarding for JVM debugger
❯ Check nodes are ACTIVE
  ⠹ Check network pod: node1  - status TIMEOUT, attempt 0/120
◼ Check node proxies are ACTIVE
◼ Add node stakes
/Users/georgistoykov/Projects/solo/node_modules/@kubernetes/client-node/dist/web-socket-handler.js:72
            throw new Error("can't send data to ws");
                  ^

Error: can't send data to ws
    at WebSocketHandler.processData (/Users/georgistoykov/Projects/solo/node_modules/@kubernetes/client-node/dist/web-socket-handler.js:72:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /Users/georgistoykov/Projects/solo/node_modules/@kubernetes/client-node/dist/web-socket-handler.js:84:22

Node.js v21.7.1

Additional Context

No response

JeffreyDallas commented 5 days ago

Could be due to limited resource, I tried with 4 nodes, 8 nodes on my 64GB macbook not seeing any failure yet.

Here is docker configuration

JeffreyDallas commented 5 days ago

Just tried with 13 nodes, now I can reproduce the error

↓ Enable port forwarding for JVM debugger
❯ Check nodes are ACTIVE
  ⠼ Check network pod: node1  - status TIMEOUT, attempt 87/120
  ⠼ Check network pod: node2  - status TIMEOUT, attempt 88/120
  ✔ Check network pod: node3  - status ACTIVE, attempt: 24/120 [35s]
  ⠼ Check network pod: node4  - status TIMEOUT, attempt 88/120
  ✔ Check network pod: node5  - status ACTIVE, attempt: 23/120 [35s]
  ⠼ Check network pod: node6  - status TIMEOUT, attempt 89/120
  ✔ Check network pod: node7  - status ACTIVE, attempt: 23/120 [36s]
  ⠼ Check network pod: node8  - status TIMEOUT, attempt 88/120
  ⠼ Check network pod: node9  - status TIMEOUT, attempt 89/120
  ⠼ Check network pod: node10  - status TIMEOUT, attempt 87/120
  ⠼ Check network pod: node11  - status TIMEOUT, attempt 86/120
  ⠼ Check network pod: node12  - status TIMEOUT, attempt 86/120
  ⠼ Check network pod: node13  - status TIMEOUT, attempt 87/120
◼ Check node proxies are ACTIVE
◼ Add node stakes
node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<ErrorEvent>".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

gsstoykov commented 3 days ago

So we can say that this error is expected?

JeffreyDallas commented 3 days ago

So we can say that this error is expected?

Yes , if the host machine has limited resources

hashgraph / solo