Open jeromy-cannon opened 1 week ago
New symptom: it is a hanging in cleaning healthy nodes. This step was always working fine before
hashsphere1@s05:~/workspaces/10nodes/solo$ task -t Taskfile.yml clean
task: [solo:node:stop] npm run solo-test -- node stop --namespace "${SOLO_NAMESPACE}" --node-aliases-unparsed node0,node1,node2,node3,node4,node5,node6
[solo:node:stop]
[solo:node:stop] > @hashgraph/solo@0.99.0 solo-test
[solo:node:stop] > node --no-deprecation --no-warnings --loader ts-node/esm solo.ts node stop --namespace solo-hashsphere1 --node-aliases-unparsed node0,node1,node2,node3,node4,node5,node6
[solo:node:stop]
[solo:node:stop]
[solo:node:stop] ******************************* Solo *********************************************
[solo:node:stop] Version : 0.99.0
[solo:node:stop] Kubernetes Context : gke_hashsphere-staging_us-central1_sphere-load-test-us-central
[solo:node:stop] Kubernetes Cluster : gke_hashsphere-staging_us-central1_sphere-load-test-us-central
[solo:node:stop] Kubernetes Namespace : solo-hashsphere1
[solo:node:stop] **********************************************************************************
[solo:node:stop] ❯ Initialize
[solo:node:stop] ❯ Acquire lease
[solo:node:stop] ✔ Acquire lease - lease acquired successfully, attempt: 1/10
[solo:node:stop] ✔ Initialize
[solo:node:stop] ❯ Identify network pods
[solo:node:stop] ❯ Check network pod: node0
[solo:node:stop] ❯ Check network pod: node1
[solo:node:stop] ❯ Check network pod: node2
[solo:node:stop] ❯ Check network pod: node3
[solo:node:stop] ❯ Check network pod: node4
[solo:node:stop] ❯ Check network pod: node5
[solo:node:stop] ❯ Check network pod: node6
^\SIGQUIT: quit
PC=0x473721 m=0 sigcode=128
goroutine 7 gp=0xc000133c00 m=0 mp=0x135b960 [syscall]:
runtime.notetsleepg(0x13bc500, 0xffffffffffffffff)
runtime/lock_futex.go:246 +0x29 fp=0xc00049a7a0 sp=0xc00049a778 pc=0x4105a9
os/signal.signal_recv()
runtime/sigqueue.go:152 +0x29 fp=0xc00049a7c0 sp=0xc00049a7a0 pc=0x46e589
os/signal.loop()
os/signal/signal_unix.go:23 +0x13 fp=0xc00049a7e0 sp=0xc00049a7c0 pc=0xa6c093
runtime.goexit({})
runtime/asm_amd64.s:1695 +0x1 fp=0xc00049a7e8 sp=0xc00049a7e0 pc=0x471921
created by os/signal.Notify.func1.1 in goroutine 1
os/signal/signal.go:151 +0x1f
goroutine 1 gp=0xc0000061c0 m=3 mp=0xc0000b3008 [syscall]:
syscall.Syscall6(0xf7, 0x1, 0xba98b, 0xc00010d978, 0x1000004, 0x0, 0x0)
syscall/syscall_linux.go:91 +0x39 fp=0xc00010d940 sp=0xc00010d8e0 pc=0x4886f9
os.(*Process).blockUntilWaitable(0xc0003da3c0)
os/wait_waitid.go:32 +0x76 fp=0xc00010da18 sp=0xc00010d940 pc=0x4f65b6
os.(*Process).wait(0xc0003da3c0)
os/exec_unix.go:22 +0x25 fp=0xc00010da78 sp=0xc00010da18 pc=0x4f04a5
os.(*Process).Wait(...)
os/exec.go:134
os/exec.(*Cmd).Wait(0xc00001e180)
os/exec/exec.go:906 +0x45 fp=0xc00010dad8 sp=0xc00010da78 pc=0x6e0b45
So it starts hanging at node stop step, before node destroy step ?
Can you attach ~/.solo/logs/solo.log and also use k9s to check what are status of network pods ?
Per Alex Kuzmin, when there are pods currently in Pending state, then the following command:
will run/hang indefinitely.
Note: he is using the taskfile.yaml
clean
target in the examples folder. So we should update the[solo:network:destroy](https://github.com/hashgraph/solo/blob/ec63a659ae325ab1631409a76e6894deccdb0ed4/examples/custom-network-config/Taskfile.yml#L110-L110)
target with the recommended timeout also.examples/custom-network-config/Taskfile.yml