krkn-chaos / krkn

Chaos and resiliency testing tool for Kubernetes with a focus on improving performance under failure conditions. A CNCF sandbox project.
Apache License 2.0
284 stars 100 forks source link

Attempt to run a container scenario for api while count is bigger than 1 results in crash #430

Open achuzhoy opened 1 year ago

achuzhoy commented 1 year ago

How to reproduce: config.yaml shold have this scenario ` chaos_scenarios: # List of policies/chaos scenarios to load

The content of the scenario file: ` scenarios:

python3.9 run_kraken.py --config config/kill-api.yaml _ _ | | ___ __ __ _| | _____ _ __ | |/ / '__/ _ | |/ / \ ' \
| <| | | (_| | < / | | |
||__| _,||__
|| ||

2023-05-25 11:58:39,485 [INFO] Starting kraken
2023-05-25 11:58:39,495 [INFO] Initializing client to talk to the Kubernetes cluster 2023-05-25 11:58:42,998 [INFO] Publishing kraken status at http://0.0.0.0:8085 2023-05-25 11:58:42,998 [INFO] Publishing kraken status at http://0.0.0.0:8085 2023-05-25 11:58:42,999 [INFO] Starting http server at http://0.0.0.0:8085

2023-05-25 11:58:43,000 [INFO] Fetching cluster info
2023-05-25 11:58:43,008 [INFO] Cluster version is 4.13.0
2023-05-25 11:58:43,008 [INFO] Server URL: https://api.elvis2.qe.lab.redhat.com:6443 2023-05-25 11:58:43,008 [INFO] Generated a uuid for the run: a713f10c-8b26-4b2c-8a81-8356cff6ef58 2023-05-25 11:58:43,008 [INFO] Daemon mode not enabled, will run through 1 iterations

2023-05-25 11:58:43,009 [INFO] Executing scenarios for iteration 0
2023-05-25 11:58:43,009 [INFO] connection set up
127.0.0.1 - - [25/May/2023 11:58:43] "GET / HTTP/1.1" 200 -
2023-05-25 11:58:43,010 [INFO] response RUN
2023-05-25 11:58:43,010 [INFO] Running container scenarios
2023-05-25 11:58:44,823 [INFO] Killing container openshift-apiserver in pod apiserver-5d45f6d58f-hmpsj (ns openshift-apiserver) 2023-05-25 11:58:44,959 [INFO] Killing container openshift-apiserver in pod apiserver-5d45f6d58f-cd7bv (ns openshift-apiserver) 2023-05-25 11:58:45,071 [INFO] Scenario kill apiserver container successfully injected Traceback (most recent call last): File "/root/krkn/krkn/run_kraken.py", line 421, in main(options.cfg) File "/root/krkn/krkn/run_kraken.py", line 218, in main failed_post_scenarios = pod_scenarios.container_run( File "/root/krkn/krkn/kraken/pod_scenarios/setup.py", line 92, in container_run failed_post_scenarios = check_failed_containers( File "/root/krkn/krkn/kraken/pod_scenarios/setup.py", line 199, in check_failed_containers killed_container_list = killed_container_list.remove(item) AttributeError: 'NoneType' object has no attribute 'remove'

`

The issue reproduced with count set to 3 The issue didn't reproduce with count set to 1.

Note that the cluster has 3 pods.

When the same was attempted against SNO (with a single api pod), the following error was thrown: 2023-05-25 12:06:17,950 [INFO] Killing container openshift-apiserver in pod apiserver-6b77769b8-6j4gg (ns openshift-apiserver) 2023-05-25 12:06:18,083 [ERROR] Trying to kill more containers than were found, try lowering kill count 2023-05-25 12:06:18,083 [ERROR] Scenario kill apiserver container failed In this case it's an expected error.

achuzhoy commented 1 year ago

Same behavior reproduced with killing etcd:

` chaos_scenarios: # List of policies/chaos scenarios to load

`

` scenarios:

python3.9 run_kraken.py --config config/kill-etcd.yaml _ _ | | ___ __ __ _| | _____ _ __ | |/ / '__/ _ | |/ / \ ' \
| <| | | (_| | < / | | |
||__| _,||__
|| ||

2023-05-25 12:23:02,066 [INFO] Starting kraken
2023-05-25 12:23:02,075 [INFO] Initializing client to talk to the Kubernetes cluster
2023-05-25 12:23:05,649 [INFO] Publishing kraken status at http://0.0.0.0:8085
2023-05-25 12:23:05,649 [INFO] Publishing kraken status at http://0.0.0.0:8085
2023-05-25 12:23:05,650 [INFO] Starting http server at http://0.0.0.0:8085

2023-05-25 12:23:05,650 [INFO] Fetching cluster info
2023-05-25 12:23:05,658 [INFO] Cluster version is 4.13.0
2023-05-25 12:23:05,659 [INFO] Server URL: https://api.elvis2.qe.lab.redhat.com:6443
2023-05-25 12:23:05,659 [INFO] Generated a uuid for the run: 77d465f6-2149-4233-b9f7-4642e84dffb0
2023-05-25 12:23:05,659 [INFO] Daemon mode not enabled, will run through 1 iterations

2023-05-25 12:23:05,659 [INFO] Executing scenarios for iteration 0
2023-05-25 12:23:05,659 [INFO] connection set up
127.0.0.1 - - [25/May/2023 12:23:05] "GET / HTTP/1.1" 200 -
2023-05-25 12:23:05,660 [INFO] response RUN
2023-05-25 12:23:05,660 [INFO] Running container scenarios
2023-05-25 12:23:08,343 [INFO] Killing container etcd in pod etcd-master-1-2 (ns openshift-etcd)
2023-05-25 12:23:08,466 [INFO] Killing container etcd in pod etcd-master-1-1 (ns openshift-etcd)
2023-05-25 12:23:08,657 [INFO] Scenario kill etcd container successfully injected
Traceback (most recent call last):
File "/root/krkn/krkn/run_kraken.py", line 421, in
main(options.cfg)
File "/root/krkn/krkn/run_kraken.py", line 218, in main
failed_post_scenarios = pod_scenarios.container_run(
File "/root/krkn/krkn/kraken/pod_scenarios/setup.py", line 92, in container_run
failed_post_scenarios = check_failed_containers(
File "/root/krkn/krkn/kraken/pod_scenarios/setup.py", line 199, in check_failed_containers
killed_container_list = killed_container_list.remove(item)
AttributeError: 'NoneType' object has no attribute 'remove'
`