Add a timeout flag for the network command destroy subcommand

jeromy-cannon commented 1 week ago

Per Alex Kuzmin, when there are pods currently in Pending state, then the following command:

solo network destroy --namespace "${SOLO_NAMESPACE}" --delete-pvcs --delete-secrets --force

will run/hang indefinitely.

Note: he is using the taskfile.yaml clean target in the examples folder. So we should update the [solo:network:destroy](https://github.com/hashgraph/solo/blob/ec63a659ae325ab1631409a76e6894deccdb0ed4/examples/custom-network-config/Taskfile.yml#L110-L110) target with the recommended timeout also. examples/custom-network-config/Taskfile.yml

### Tasks
- [x] add a `--timeout` flag to `solo network destroy`, default it to 120 seconds.  If the timeout is reached and it is still running, then abort the command and continue.
- [x] if the `--timeout` flag is used and the following 3 flags are also used: ` --delete-pvcs --delete-secrets --force` then when the timeout is reached abort the command and then perform a delete on the namespace, use the timeout on this command too, and abort if it also runs over.
- [x] log appropriate error messages to the user when the timeout is reached and the command is aborted
- [x] make sure if the timeout is reached that we still give a non zero return code when exiting
- [x] update the `solo:network:destroy` target in `examples/custom-network-config/Taskfile.yml` to include the new `--timeout` option

alex-kuzmin-hg commented 4 days ago

New symptom: it is a hanging in cleaning healthy nodes. This step was always working fine before

hashsphere1@s05:~/workspaces/10nodes/solo$ task -t Taskfile.yml clean
task: [solo:node:stop] npm run solo-test -- node stop --namespace "${SOLO_NAMESPACE}" --node-aliases-unparsed node0,node1,node2,node3,node4,node5,node6 
[solo:node:stop] 
[solo:node:stop] > @hashgraph/solo@0.99.0 solo-test
[solo:node:stop] > node --no-deprecation --no-warnings --loader ts-node/esm solo.ts node stop --namespace solo-hashsphere1 --node-aliases-unparsed node0,node1,node2,node3,node4,node5,node6
[solo:node:stop] 
[solo:node:stop] 
[solo:node:stop] ******************************* Solo *********************************************
[solo:node:stop] Version            : 0.99.0
[solo:node:stop] Kubernetes Context : gke_hashsphere-staging_us-central1_sphere-load-test-us-central
[solo:node:stop] Kubernetes Cluster : gke_hashsphere-staging_us-central1_sphere-load-test-us-central
[solo:node:stop] Kubernetes Namespace   : solo-hashsphere1
[solo:node:stop] **********************************************************************************
[solo:node:stop] ❯ Initialize
[solo:node:stop] ❯ Acquire lease
[solo:node:stop] ✔ Acquire lease - lease acquired successfully, attempt: 1/10
[solo:node:stop] ✔ Initialize
[solo:node:stop] ❯ Identify network pods
[solo:node:stop] ❯ Check network pod: node0
[solo:node:stop] ❯ Check network pod: node1
[solo:node:stop] ❯ Check network pod: node2
[solo:node:stop] ❯ Check network pod: node3
[solo:node:stop] ❯ Check network pod: node4
[solo:node:stop] ❯ Check network pod: node5
[solo:node:stop] ❯ Check network pod: node6
^\SIGQUIT: quit
PC=0x473721 m=0 sigcode=128

goroutine 7 gp=0xc000133c00 m=0 mp=0x135b960 [syscall]:
runtime.notetsleepg(0x13bc500, 0xffffffffffffffff)
    runtime/lock_futex.go:246 +0x29 fp=0xc00049a7a0 sp=0xc00049a778 pc=0x4105a9
os/signal.signal_recv()
    runtime/sigqueue.go:152 +0x29 fp=0xc00049a7c0 sp=0xc00049a7a0 pc=0x46e589
os/signal.loop()
    os/signal/signal_unix.go:23 +0x13 fp=0xc00049a7e0 sp=0xc00049a7c0 pc=0xa6c093
runtime.goexit({})
    runtime/asm_amd64.s:1695 +0x1 fp=0xc00049a7e8 sp=0xc00049a7e0 pc=0x471921
created by os/signal.Notify.func1.1 in goroutine 1
    os/signal/signal.go:151 +0x1f

goroutine 1 gp=0xc0000061c0 m=3 mp=0xc0000b3008 [syscall]:
syscall.Syscall6(0xf7, 0x1, 0xba98b, 0xc00010d978, 0x1000004, 0x0, 0x0)
    syscall/syscall_linux.go:91 +0x39 fp=0xc00010d940 sp=0xc00010d8e0 pc=0x4886f9
os.(*Process).blockUntilWaitable(0xc0003da3c0)
    os/wait_waitid.go:32 +0x76 fp=0xc00010da18 sp=0xc00010d940 pc=0x4f65b6
os.(*Process).wait(0xc0003da3c0)
    os/exec_unix.go:22 +0x25 fp=0xc00010da78 sp=0xc00010da18 pc=0x4f04a5
os.(*Process).Wait(...)
    os/exec.go:134
os/exec.(*Cmd).Wait(0xc00001e180)
    os/exec/exec.go:906 +0x45 fp=0xc00010dad8 sp=0xc00010da78 pc=0x6e0b45

JeffreyDallas commented 4 days ago

So it starts hanging at node stop step, before node destroy step ?

Can you attach ~/.solo/logs/solo.log and also use k9s to check what are status of network pods ?

alex-kuzmin-hg commented 4 days ago

image (1)

hashgraph / solo

Add a timeout flag for the network command destroy subcommand #815