perf-testing framework - Githubissues

djmitche commented 4 years ago

new repo containing some code to simulate workers, creating tasks, scanning tasks, and other API calls at approximately the level we see in production

TODO:

[x] refactor config to take a YAML file instead of env vars, allow running the same loader multiple times at once
[x] some way to use multiple taskQueueIds in parallel for creating/consuming tasks, as that seems to be the limit to scaling
[x] status summary per loader in the monitor output
[x] summarize rates over different time windows (last 10s, last 60s)
[x] generate tasks internally, rather than requiring task.yml
[x] loaders into separate files
[x] analyze prod logs to get rates for various methods, make loaders for the top an/or most expensive
- [x] break out the "claimwork" loader into one that calls createTask at a given rate (with watching pendingTasks as a safety valve), and another that claims and resolves tasks in a queue
- [x] make artifact API calls as well in the latter
- [x] add a loader for secrets
- [x] loader for task fetches
- [ ] loader for taskgroup fetches
- [x] loader for artifact fetches
- [x] loader for dependency fetches
- [x] loader for pendingTasks calls

djmitche commented 4 years ago

WIP at https://github.com/taskcluster/performance-tester

djmitche commented 4 years ago

This is currently topping out at about 45 tasks/second in my dev env. That's with 25 replicas of the queue-web service. It hit about 40 with one replica.

Azure is showing about 60k transactions / minute (so 1k/sec) at that rate.

djmitche commented 4 years ago

I haven't seen any ServerBusy or other Azure errors from this account.

I saw some MAX_MODIFY_ATTEMPTS errors updating the worker info table when I had 300 "workers" with the same workerId, but no error logging at all since I fixed that.

djmitche commented 4 years ago

Running this with 1000 "workers" in parallel ends up with some 502's from the load balancer.

djmitche commented 4 years ago

I suspect we may be running into limits of the ingress here -- running both the expandscopes and claimwork generators in parallel makes the expandscopes calls fall behind -- but that's a totally different service!

djmitche commented 4 years ago

I suspect we may be running into limits of the ingress here -- running both the expandscopes and claimwork generators in parallel makes the expandscopes calls fall behind -- but that's a totally different service!

Ah, this might be due to the limit on number of sockets that TC-client will use, by default.

djmitche commented 4 years ago

I got up to 50 tasks/sec yesterday, while also running 200 rq/s of other API methods.

djmitche commented 4 years ago

I sampled 10 minutes of API calls on the firefoxcitc cluster, and here are the counts:

      1 auth.createClient
      2 queue.reportException
      8 auth.currentScopes
     10 index.listTasksPost
     10 notify.email
     15 github.githubWebHookConsumer
     16 auth.gcpCredentials
     19 hooks.triggerHook
     24 index.insertTask
     59 github.ping
     59 hooks.ping
     59 purge-cache.ping
     59 worker-manager.ping
     60 notify.ping
     62 queue.reportFailed
     70 auth.awsS3Credentials
     98 queue.listTaskGroup
    119 secrets.ping
    156 queue.listProvisioners
    206 auth.azureTableSAS
    211 queue.listWorkerTypes
    223 auth.websocktunnelToken
    237 index.ping
    322 worker-manager.registerWorker
    555 auth.expandScopes
    595 auth.ping
    893 queue.ping
-- 99% are below this line --
   1002 queue.reportCompleted
   1055 secrets.get
   1208 queue.reclaimTask
   1640 index.findTask
   1713 index.findArtifactFromTask
   2421 queue.listLatestArtifacts
   2463 queue.createTask
   2511 queue.pendingTasks
   2847 queue.listWorkers
   4357 queue.listArtifacts
   6748 purge-cache.purgeRequests
   6849 queue.getArtifact
  10792 queue.createArtifact
  12355 queue.getLatestArtifact
  18990 queue.task
  23313 queue.claimWork
  40383 auth.authenticateHawk
 252647 queue.status

Overall that's 397442 API calls, or about 663 r/s.

djmitche commented 4 years ago

..and 2/3 of that is queue.status. Yikes!

djmitche commented 4 years ago

@sciurus has suggested using bigquery to get a more representative sample of API calls.

djmitche commented 4 years ago

Google doesn't let you save data from BigQuery with Firefox, apparently. I for one welcome our new Alphabet Overlords.

SELECT
  jsonPayload.servicecontext.service,
  jsonPayload.fields.name,
  ROUND(COUNT(jsonPayload.fields.name)/(3600*24), 3) as rps
FROM
  `moz-fx-taskcluster-prod-4b87.log_storage.stdout_20200226`
GROUP BY
  jsonPayload.fields.name, jsonPayload.servicecontext.service
ORDER BY rps DESC

djmitche commented 4 years ago

(no great surprises there, really)

djmitche commented 4 years ago

As written, this is not breaking a sweat and not causing TC or Pg to break a sweat in my dev environment. I need to add some more simluators (the checkboxes above) but I'm confident that will work just fine.

djmitche commented 4 years ago

I've been running this successfully with

ders:
  # targetting about 4 tasks/sec, but with some task queues having an excess of workers
  # and some having an excess of tasks (createtask will throttle itself to keep a decent
  # list of pending tasks)
  test-load-create1:
    use: createtask
    rate: 2
    task-queue-id: proj-taskcluster/load-test1
  test-load-create2:
    use: createtask
    rate: 1
    task-queue-id: proj-taskcluster/load-test2
  test-load-create3:
    use: createtask
    rate: 1
    task-queue-id: proj-taskcluster/load-test3
  test-load1:
    use: claimwork
    parallelism: 100
    task-queue-id: proj-taskcluster/load-test1
  test-load2:
    use: claimwork
    parallelism: 100
    capacity: 4
    task-queue-id: proj-taskcluster/load-test2
  test-load3:
    use: claimwork
    parallelism: 300
    task-queue-id: proj-taskcluster/load-test3

  # random general API method load generation
  gettask:
    use: gettask
    rate: 31.7
  gettaskstatus:
    use: gettaskstatus
    rate: 421 # 421!
  getartifacts:
    use: getartifacts
    listrate: 11
    listlatestrate: 11
    getrate: 16
    getlatestrate: 49
  secrets:
    use: secrets
    rate: 2
    secret: garbage
  pendingtasks:
    use: pendingtasks
    task-queue-ids:
      - proj-taskcluster/load-test
      - builtin/success
      - builtin/failure
    rate: 0.5  # on top of the rate from the various load generators, above

against my dev environment running the latest tc-lib-postgres -- so no azure queues or tables at all (even roles have now been converted). This is a steady 800 tps on the DB server, with about 75 active DB connections. DB ingress is 300kiB/s, egress 530KiB/s. CPU utilization is stable at about 45%.

I'll add the additional perf testers described in the linked issues and keep trying.

taskcluster / taskcluster

perf-testing framework #2273