Database backend - stress test and failure test plan

Test Environment

The cluster has 10000 existing jobs and about 10 nodes, 50 GPUs. Hived scheduler is enabled.

In each case, we test the list/job detail/submit job latency (fire 10 requests and calculate the average latency). All listing has an offset and a limit number of 20.

If there is no load, the latency is about:

list job	get job detail	submit a job
54.1 ms ± 17.2 ms	56.6 ms ± 29.2 ms	343 ms ± 125 ms

Stress Test

Job with a large task number

Submit 1 job with 250/1000/5000 tasks, and open 20+ job detail pages, check whether it will cause cluster unstability. Also check if we can submit new job and view other job's detail.

250 tasks

get detail of this job	list job	get detail of other job	submit a job
186 ms ± 71.5 ms	168 ms ± 45.6 ms	112 ms ± 42.2 ms	396 ms ± 61.4 ms

1000 tasks

get detail of this job	list job	get detail of other job	submit a job
204 ms ± 59.8 ms	140 ms ± 93.9 ms	158 ms ± 93.4 ms	556 ms ± 202 ms

5000 tasks

get detail of this job	list job	get detail of other job	submit a job
552 ms ± 109 ms	180 ms ± 131 ms	157 ms ± 85.4 ms	496 ms ± 134 ms

In real use, user will exprence a large transport time because now the job detail json has 8MB+.

Problems found: tasks have unexpected retry caused by #4841

Large amount of jobs

Quick test: 1000 jobs will finish in 410s/522 s. Thus throughput is about 2 jobs/second. Db controller is not bottleneck.

Submit 2/10 jobs one second for 1 hour, each job finishes immediately. Check whether it will cause cluster unstability. Also check if we can submit new job and view other job's detail.

2 jobs/second for 1 hour

During submission:

list job	get job detail	submit a job
367 ms ± 85 ms	319 ms ± 140 ms	500 ms ± 125 ms

10 job/second for 1 hour

During submission:

list job	get job detail	submit a job
1.19 s ± 895 ms	623 ms ± 477 ms	7.32 s ± 2.92 s

Problems found:

db controller memory issue, concurrency issue #4845
too many jobs will cause #4833

Job with a large task number and large retry times

Submit 1 job with 250 tasks and 100 retries.

list job	get job detail	submit a job

Problems found:

Cannot view retry history of jobs with large task number and large retry times #4846

Failure Test

Please launch a dev-box first, and stop all services. Back up the previous data: sudo cp -r /mnt/paiInternal /mnt/paiInternalBak on master node

Shutdown database with ./paictl.py service stop -n postgresql, and wait for a while. Expect: We cannot query or submit job, other services don't fail. Please record the error message

view job list:

submit job: refresh job detail: new job detail:

Start database with ./paictl.py service start -n postgresql All function should become normal after a while.

Go to the master node, kill the corresponding process;

postgresql: use ps aux | grep postgres to find it
write-merger/framework-watcher/db-poller: ps aux | grep write-merger; ps aux | grep watcher/framework; ps aux | grep poller/index
rest-server: ps aux | grep 'node index.js'
api server: ps aux | grep kube-api
framework controller: ps aux | grep frameworkcontroller

Expect: All function should become normal after a while.

Data destroying test

Step 1 Submit a long-running job in OpenPAI; Step 2 Destroy all database data: Go to the master node; remove or random delete files in /mnt/paiInternal Step 3 Restart pai cluster by ./paictl.py service stop and ./paictl.py service start Expect: The cluster should be OK. But all previous job data are lost, but you can find the long-running job in webportal.

microsoft / pai