add concurrency to reliability test

qiliRedHat commented 3 years ago

https://issues.redhat.com/browse/OCPQE-4221 Reliability Test Tool Enhancement Plan

Concurrent task run with multiple users - priority 0-1

What Today the test execution is sequentially with default user system:admin and then as the user specified in the "login" action. This task is to add concurrency for each type of user added in phase 1. X admin/developer users can run m tasks(oc commands) concurrently. (currently only one admin user in config yaml structure, can restructure in the future) Y end users can visit n applications concurrently

Why

Simulate concurrent operations from different users like the real world.
Shorten the time of test execution. For example, for 51 users doing the login operation: sequentially: 243.6s concurrently(51 threads): 13.0s

How

According to the setting of task in config file, choose 'concurrency' number of users for each 'persona' (admin, developer, user). With the config below, 'testuser-0' and 'testuser-1' will execute the project create operation concurrently, while 10 end users will visit all (100%) applications - every application will be visited by 10 users concurrently.
```
  - action: create
    resource: projects
    quantity: 2
    persona: developer
    concurrency: 2
  - action: visit
    resource: apps
    applyPercent: 100
    persona: user
    concurrency: 10
```
Modify Task.py, for each type of action, use multiple threads to run the tasks with oc or shell commands.
Modify functions Projects.py, Apps.py, and Pods.py to accept 'kubeconfig' parameter. Add -n namespace to a command if needed, such as getting pod - a dev user can only list pods under its own namespace. For operations that can only be run by admin, such as labeling namespace, using admin user's kubeconfig.
Add global locks to protect the shared mutable counters, like the project id, total project number, app number, build number.
The '+=' operations are not thread safe. Is adding and poping a dict's entry in a shared map with multiple thread thread safe? - I think, yes. https://docs.python.org/3/glossary.html#term-global-interpreter-lock https://docs.python.org/3/faq/library.html#what-kinds-of-global-value-mutation-are-thread-safe
Handle the situation caused by ansync operations. For example, an app is chosen to be upgraded, while the project deletion (--wait=false is async) actually done after that.
For app visit, the current solution is visiting sequentially. Implement concurrency with corutines (asyncio|https://github.com/timofurrer/awesome-asyncio/blob/master/README.md + aiohttp. Corutines can support high concurrency for persona user. Concurrency is better than multi threading as it is single-threaded, multiplexing I/O access, thus cost is lower. Add a new LoadApp module to load apps, use this module to count the application visit succeeded and failed numbers. The visit function is Apps module is kept for verify application route availability after app creation. In this way, the app visit succeeded and failed numbers only count the visits after route is available, 503 before available will not be counted. This can better reflect the user persona's concern - app visiting after it is ready. example of an app visit task config. This config means 10 persona user will visit all applications concurrently.
```
  - action: visit
    resource: apps
    applyPercent: 100
    persona: user # actually not used
    concurrency: 10
```

Exit Criteria Config with all kinds of tasks can be run with multiple users in parallel. python reliability.py -f config/enhance_reliability.yaml

qiliRedHat commented 3 years ago

@mffiedler PTAL @wabouhamad @paigerube14

mffiedler commented 3 years ago

In my first test, after 33 minutes I got the following stack - looks like it was first time project deletion was run: 2021-06-11 20:36:10,038 - INFO - Project deleted : testuser-3-2 2021-06-11 20:36:10,042 - INFO - 0: project.project.openshift.io "testuser-4-1" deleted

2021-06-11 20:36:10,042 - INFO - Project deleted : testuser-4-1 Traceback (most recent call last): File "/root/tmp/svt/reliability/reliability.py", line 28, in task_manager.start() File "/root/tmp/svt/reliability/tasks/TaskManager.py", line 134, in start task.execute() File "/root/tmp/svt/reliability/tasks/Task.py", line 254, in execute login_args.append((name, password, kubeconf)) NameError: name 'name' is not defined

Have not investigated yet.

qiliRedHat commented 3 years ago

In my first test, after 33 minutes I got the following stack - looks like it was first time project deletion was run: 2021-06-11 20:36:10,038 - INFO - Project deleted : testuser-3-2 2021-06-11 20:36:10,042 - INFO - 0: project.project.openshift.io "testuser-4-1" deleted

2021-06-11 20:36:10,042 - INFO - Project deleted : testuser-4-1 Traceback (most recent call last): File "/root/tmp/svt/reliability/reliability.py", line 28, in task_manager.start() File "/root/tmp/svt/reliability/tasks/TaskManager.py", line 134, in start task.execute() File "/root/tmp/svt/reliability/tasks/Task.py", line 254, in execute login_args.append((name, password, kubeconf)) NameError: name 'name' is not defined

Have not investigated yet.

This is not about project deletion, but about login action after project deletion. This is a bug introduced when I update the login part to make it to use the new login function in Session.py. I fixed it in commit https://github.com/openshift/svt/pull/670/commits/4e11837b0c9543650f53a6cfdb844bc6bf06c14e. Thanks!

qiliRedHat commented 3 years ago

@mffiedler Thank you for the first review. Please check my reply and comment above. After the review is done, I'll do a rebase to keep a single commit.

paigerube14 commented 3 years ago

    result = self.fn(*self.args, **self.kwargs)
  File ".../svt/reliability/tasks/Task.py", line 243, in <lambda>
    results = executor.map(lambda t: all_pods.check(*t), pod_check_args)
  File ".../svt/reliability/tasks/Pods.py", line 10, in check
    if user == "kubeadmin":
NameError: name 'user' is not defined

Qiujie, I am getting this error when running. I think it is coming from inside the check pods function if statement around line 240 that you are missing passing the user

After adding the user to that line I get the following issue. It seems that when checking for the pods it's not looking in the correct project, you're only looking in the default project. Probably just need to add "-n " to the line when you are getting the pods.

2021-06-15 09:54:49,933 - INFO - =>oc get pods --kubeconfig /Users/prubenda/PycharmProjects/pplRepos/qili/svt/reliability/kubeconfigs/kubeconfig_kubeadmin| egrep -v "Running|Complete"
2021-06-15 09:54:49,933 - INFO - Pod check failed for namespace : kubeadmin-4
2021-06-15 09:54:50,355 - INFO - 1: No resources found in default namespace.

2021-06-15 09:54:50,356 - ERROR - get pods: failed

Think you just need to have line 11 match the same format as line 13

In that case do you need that if statement because they are the same line?

qiliRedHat commented 3 years ago

    result = self.fn(*self.args, **self.kwargs)
  File ".../svt/reliability/tasks/Task.py", line 243, in <lambda>
    results = executor.map(lambda t: all_pods.check(*t), pod_check_args)
  File ".../svt/reliability/tasks/Pods.py", line 10, in check
    if user == "kubeadmin":
NameError: name 'user' is not defined
Qiujie, I am getting this error when running. I think it is coming from inside the check pods function if statement around line 240 that you are missing passing the user

After adding the user to that line I get the following issue. It seems that when checking for the pods it's not looking in the correct project, you're only looking in the default project. Probably just need to add "-n " to the line when you are getting the pods.
2021-06-15 09:54:49,933 - INFO - =>oc get pods --kubeconfig /Users/prubenda/PycharmProjects/pplRepos/qili/svt/reliability/kubeconfigs/kubeconfig_kubeadmin| egrep -v "Running|Complete"
2021-06-15 09:54:49,933 - INFO - Pod check failed for namespace : kubeadmin-4
2021-06-15 09:54:50,355 - INFO - 1: No resources found in default namespace.

2021-06-15 09:54:50,356 - ERROR - get pods: failed
Think you just need to have line 11 match the same format as line 13

In that case do you need that if statement because they are the same line?

Thank you for catching this! Regarding the if, my intention was to let the admin persona user to show all pods in all namespaces, that's more like a real admin wants to do. You make me realized I need to add '-A' to show all namespaces's pods.

mffiedler commented 3 years ago

@qiliRedHat Can you squash the commits and then @paigerube14 and I can do final review/test? Thanks.

qiliRedHat commented 3 years ago

@qiliRedHat Can you squash the commits and then @paigerube14 and I can do final review/test? Thanks.

@mffiedler @paigerube14, I squashed the commits.

mffiedler commented 3 years ago

One last change and I think this is good to go. All users, especially kubeadmin, have to login once every 24 hours real time to keep their sessions alive. Running over a weekend I saw admin oc commands start to fail with:

2021-06-21 02:00:44,373 - INFO - =>oc get clusteroperators --kubeconfig /root/tmp/svt2/reliability/kubeconfigs/kubeconfig_kubeadmin -o yaml 2021-06-21 02:00:44,643 - INFO - 1: apiVersion: v1 items: [] kind: List metadata: resourceVersion: "" selfLink: "" error: You must be logged in to the server (Unauthorized)

2021-06-21 02:00:44,643 - ERROR - get clusteroperators: failed 2021-06-21 02:01:14,673 - INFO - =>oc get pods -A --kubeconfig /root/tmp/svt2/reliability/kubeconfigs/kubeconfig_kubeadmin| egrep -v "Running|Complete" 2021-06-21 02:01:14,955 - INFO - 1: error: You must be logged in to the server (Unauthorized)

In the past this was handled by making the admin user login in a task, but we could make it part of the test infrastructure as well. @qiliRedHat Let me know if you prefer to handle this as a separate issue after merging this PR. I am ok with that too

qiliRedHat commented 3 years ago

@mffiedler Great thanks to let me know this expiration. https://github.com/openshift/svt/pull/670/commits/86059b3ed081620e8898a23621a02ed47adabc25 added the mechanism, to do re login in for all users before a task run if it's been 23 hours since last login. I tested it with shorter time interval - 10 minutes, it works well.

mffiedler commented 3 years ago

excellent. Let me do one final test for 24+ hours and then I will plan on merging tomorrow.

qiliRedHat commented 3 years ago

@mffiedler Thanks you for the test and bug caching! I squashed.

paigerube14 commented 3 years ago

Reran yesterday, not for as long as Mike's test, things looked good to me!

qiliRedHat commented 3 years ago

@paigerube14 Thank you for your time reviewing this pr and the bug you found!

openshift / svt

add concurrency to reliability test #670

Concurrent task run with multiple users - priority 0-1