tapis-project / tapis-java

Texas Advanced Computing Center APIs
BSD 3-Clause "New" or "Revised" License
5 stars 6 forks source link

Jobs Load Test #34

Open richcar58 opened 3 years ago

richcar58 commented 3 years ago

Create a test harness that applies a substantial load on the job execution subsystem and guarentee that the Jobs service is well behaved under extreme conditions. The harness includes a test program and all that's needed to launch the program in supported environments. The initial supported environment is an IDE with documented procedures explaining how to run the tests and check the outcome manually.

The actual test submits any number of jobs to a target environment and determines the outcome of those jobs. The test program will execute one of our SleepSeconds applications (docker or singularity) on a VM. The test will submit jobs serially but as quickly as possible. We can start with moderate loads, say by submitting 100 jobs, and progress to submitting 10,000 jobs. We declare victory after two clean 10,000 job runs.

Each run will specify a range in which randomly assigned sleep times are assigned; we specify a minumum and maximum sleep time in seconds. For example, for a 10,000 job run we'll assign a range of 10 to 90 second sleep time. This yields an average runtime of 50 seconds with staggered starts and stops. We can also set the minimum and maximum the same, which would configure all jobs to run for the same amount of time. This regularity is good for smaller test runs to get a baseline, but it does not model the natural variation of real job executions, which is why we need the range configuration.

The Jobs service is well behaved if all jobs complete successfully and the total time for a batch of jobs increases linearly with the number of jobs. That is, the average time for job execution should not vary that much no matter how many jobs are in the backlog. The Jobs engine should hum along at 60 miles an hour no matter how long the trip is.

Because we want to run thousands of jobs many times over, we'll need to run the Jobs service on a laptop to avoid clogging our shared environments with all the historical data generated by the tests. This means that RabbitMQ and Postgres need run locally, as well as the Jobs web application, worker daemon and recovery daemon. Jobs will still communicate with other services running in the DEV environment. We can use the execution and storage systems that we usually do in our development environment.

Some capacity planning will be required. We'll need to estimate the amount of disk space that will be used on the execution and storage systems depending on what inputs and outputs we specify on our jobs. In addition, the Files service records each transfer so its database may fill up, again depending our inputs and outputs. Though we may be interesting to learn something about the scaleability of Files, the focus of this test is Jobs, so we may want to minimize file transfers.