The cluster has 10000 existing jobs and about 10 nodes, 50 GPUs. Hived scheduler is enabled.
In each case, we test the list/job detail/submit job latency (fire 10 requests and calculate the average latency). All listing has an offset and a limit number of 20.
If there is no load, the latency is about:
list job
get job detail
submit a job
54.1 ms ± 17.2 ms
56.6 ms ± 29.2 ms
343 ms ± 125 ms
Stress Test
Job with a large task number
Submit 1 job with 250/1000/5000 tasks, and open 20+ job detail pages, check whether it will cause cluster unstability.
Also check if we can submit new job and view other job's detail.
250 tasks
get detail of this job
list job
get detail of other job
submit a job
186 ms ± 71.5 ms
168 ms ± 45.6 ms
112 ms ± 42.2 ms
396 ms ± 61.4 ms
1000 tasks
get detail of this job
list job
get detail of other job
submit a job
204 ms ± 59.8 ms
140 ms ± 93.9 ms
158 ms ± 93.4 ms
556 ms ± 202 ms
5000 tasks
get detail of this job
list job
get detail of other job
submit a job
552 ms ± 109 ms
180 ms ± 131 ms
157 ms ± 85.4 ms
496 ms ± 134 ms
In real use, user will exprence a large transport time because now the job detail json has 8MB+.
Problems found: tasks have unexpected retry caused by #4841
Large amount of jobs
Quick test: 1000 jobs will finish in 410s/522 s. Thus throughput is about 2 jobs/second. Db controller is not bottleneck.
Submit 2/10 jobs one second for 1 hour, each job finishes immediately. Check whether it will cause cluster unstability. Also check if we can submit new job and view other job's detail.
2 jobs/second for 1 hour
During submission:
list job
get job detail
submit a job
367 ms ± 85 ms
319 ms ± 140 ms
500 ms ± 125 ms
10 job/second for 1 hour
During submission:
list job
get job detail
submit a job
1.19 s ± 895 ms
623 ms ± 477 ms
7.32 s ± 2.92 s
Problems found:
db controller memory issue, concurrency issue #4845
too many jobs will cause #4833
Job with a large task number and large retry times
Submit 1 job with 250 tasks and 100 retries.
list job
get job detail
submit a job
Problems found:
Cannot view retry history of jobs with large task number and large retry times #4846
Failure Test
Please launch a dev-box first, and stop all services.
Back up the previous data: sudo cp -r /mnt/paiInternal /mnt/paiInternalBak on master node
Shutdown database with ./paictl.py service stop -n postgresql, and wait for a while.
Expect: We cannot query or submit job, other services don't fail. Please record the error message
view job list:
submit job:
refresh job detail:
new job detail:
Start database with ./paictl.py service start -n postgresql
All function should become normal after a while.
Go to the master node, kill the corresponding process;
postgresql: use ps aux | grep postgres to find it
write-merger/framework-watcher/db-poller: ps aux | grep write-merger; ps aux | grep watcher/framework; ps aux | grep poller/index
rest-server: ps aux | grep 'node index.js'
api server: ps aux | grep kube-api
framework controller: ps aux | grep frameworkcontroller
Expect: All function should become normal after a while.
Data destroying test
Step 1 Submit a long-running job in OpenPAI;
Step 2 Destroy all database data: Go to the master node; remove or random delete files in /mnt/paiInternal
Step 3 Restart pai cluster by ./paictl.py service stop and ./paictl.py service start
Expect: The cluster should be OK. But all previous job data are lost, but you can find the long-running job in webportal.
Test Environment
The cluster has 10000 existing jobs and about 10 nodes, 50 GPUs. Hived scheduler is enabled.
In each case, we test the list/job detail/submit job latency (fire 10 requests and calculate the average latency). All listing has an offset and a limit number of 20.
If there is no load, the latency is about:
Stress Test
Job with a large task number
Submit 1 job with 250/1000/5000 tasks, and open 20+ job detail pages, check whether it will cause cluster unstability. Also check if we can submit new job and view other job's detail.
250 tasks
1000 tasks
5000 tasks
In real use, user will exprence a large transport time because now the job detail json has 8MB+.
Problems found: tasks have unexpected retry caused by #4841
Large amount of jobs
Quick test: 1000 jobs will finish in 410s/522 s. Thus throughput is about 2 jobs/second. Db controller is not bottleneck.
Submit 2/10 jobs one second for 1 hour, each job finishes immediately. Check whether it will cause cluster unstability. Also check if we can submit new job and view other job's detail.
2 jobs/second for 1 hour
During submission:
10 job/second for 1 hour
During submission:
Problems found:
Job with a large task number and large retry times
Submit 1 job with 250 tasks and 100 retries.
Problems found:
Cannot view retry history of jobs with large task number and large retry times #4846
Failure Test
Please launch a dev-box first, and stop all services. Back up the previous data:
sudo cp -r /mnt/paiInternal /mnt/paiInternalBak
on master node./paictl.py service stop -n postgresql
, and wait for a while. Expect: We cannot query or submit job, other services don't fail. Please record the error messageview job list:
submit job: refresh job detail: new job detail:
Start database with
./paictl.py service start -n postgresql
All function should become normal after a while.ps aux | grep postgres
to find itps aux | grep write-merger
;ps aux | grep watcher/framework
;ps aux | grep poller/index
ps aux | grep 'node index.js'
ps aux | grep kube-api
ps aux | grep frameworkcontroller
Expect: All function should become normal after a while.
Data destroying test
Step 1 Submit a long-running job in OpenPAI; Step 2 Destroy all database data: Go to the master node; remove or random delete files in
/mnt/paiInternal
Step 3 Restart pai cluster by./paictl.py service stop
and./paictl.py service start
Expect: The cluster should be OK. But all previous job data are lost, but you can find the long-running job in webportal.