microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

Database backend - stress test and failure test plan #4818

Open hzy46 opened 4 years ago

hzy46 commented 4 years ago

Test Environment

The cluster has 10000 existing jobs and about 10 nodes, 50 GPUs. Hived scheduler is enabled.

In each case, we test the list/job detail/submit job latency (fire 10 requests and calculate the average latency). All listing has an offset and a limit number of 20.

If there is no load, the latency is about:

list job get job detail submit a job
54.1 ms ± 17.2 ms 56.6 ms ± 29.2 ms 343 ms ± 125 ms

Stress Test

Job with a large task number

Submit 1 job with 250/1000/5000 tasks, and open 20+ job detail pages, check whether it will cause cluster unstability. Also check if we can submit new job and view other job's detail.

250 tasks

get detail of this job list job get detail of other job submit a job
186 ms ± 71.5 ms 168 ms ± 45.6 ms 112 ms ± 42.2 ms 396 ms ± 61.4 ms

1000 tasks

get detail of this job list job get detail of other job submit a job
204 ms ± 59.8 ms 140 ms ± 93.9 ms 158 ms ± 93.4 ms 556 ms ± 202 ms

5000 tasks

get detail of this job list job get detail of other job submit a job
552 ms ± 109 ms 180 ms ± 131 ms 157 ms ± 85.4 ms 496 ms ± 134 ms

In real use, user will exprence a large transport time because now the job detail json has 8MB+.

Problems found: tasks have unexpected retry caused by #4841

Large amount of jobs

Quick test: 1000 jobs will finish in 410s/522 s. Thus throughput is about 2 jobs/second. Db controller is not bottleneck.

Submit 2/10 jobs one second for 1 hour, each job finishes immediately. Check whether it will cause cluster unstability. Also check if we can submit new job and view other job's detail.

2 jobs/second for 1 hour

During submission:

list job get job detail submit a job
367 ms ± 85 ms 319 ms ± 140 ms 500 ms ± 125 ms

10 job/second for 1 hour

During submission:

list job get job detail submit a job
1.19 s ± 895 ms 623 ms ± 477 ms 7.32 s ± 2.92 s

Problems found:

  1. db controller memory issue, concurrency issue #4845
  2. too many jobs will cause #4833

Job with a large task number and large retry times

Submit 1 job with 250 tasks and 100 retries.

list job get job detail submit a job

Problems found:

Cannot view retry history of jobs with large task number and large retry times #4846

Failure Test

Please launch a dev-box first, and stop all services. Back up the previous data: sudo cp -r /mnt/paiInternal /mnt/paiInternalBak on master node

  1. Shutdown database with ./paictl.py service stop -n postgresql, and wait for a while. Expect: We cannot query or submit job, other services don't fail. Please record the error message

view job list:
image image

submit job: image image refresh job detail: image image new job detail: image image

Start database with ./paictl.py service start -n postgresql All function should become normal after a while.

  1. Go to the master node, kill the corresponding process;

Expect: All function should become normal after a while.

  1. Data destroying test

    Step 1 Submit a long-running job in OpenPAI; Step 2 Destroy all database data: Go to the master node; remove or random delete files in /mnt/paiInternal Step 3 Restart pai cluster by ./paictl.py service stop and ./paictl.py service start Expect: The cluster should be OK. But all previous job data are lost, but you can find the long-running job in webportal.

hzy46 commented 4 years ago

After stress test, I raise heap memory limit for write merger to 2GB, watcher to 8GB, poller to 4GB.

I believe it is OK to handle 30000 active jobs.