Closed djmitche closed 4 years ago
This is currently topping out at about 45 tasks/second in my dev env. That's with 25 replicas of the queue-web service. It hit about 40 with one replica.
Azure is showing about 60k transactions / minute (so 1k/sec) at that rate.
I haven't seen any ServerBusy or other Azure errors from this account.
I saw some MAX_MODIFY_ATTEMPTS errors updating the worker info table when I had 300 "workers" with the same workerId, but no error logging at all since I fixed that.
Running this with 1000 "workers" in parallel ends up with some 502's from the load balancer.
I suspect we may be running into limits of the ingress here -- running both the expandscopes and claimwork generators in parallel makes the expandscopes calls fall behind -- but that's a totally different service!
I suspect we may be running into limits of the ingress here -- running both the expandscopes and claimwork generators in parallel makes the expandscopes calls fall behind -- but that's a totally different service!
Ah, this might be due to the limit on number of sockets that TC-client will use, by default.
I got up to 50 tasks/sec yesterday, while also running 200 rq/s of other API methods.
I sampled 10 minutes of API calls on the firefoxcitc cluster, and here are the counts:
1 auth.createClient
2 queue.reportException
8 auth.currentScopes
10 index.listTasksPost
10 notify.email
15 github.githubWebHookConsumer
16 auth.gcpCredentials
19 hooks.triggerHook
24 index.insertTask
59 github.ping
59 hooks.ping
59 purge-cache.ping
59 worker-manager.ping
60 notify.ping
62 queue.reportFailed
70 auth.awsS3Credentials
98 queue.listTaskGroup
119 secrets.ping
156 queue.listProvisioners
206 auth.azureTableSAS
211 queue.listWorkerTypes
223 auth.websocktunnelToken
237 index.ping
322 worker-manager.registerWorker
555 auth.expandScopes
595 auth.ping
893 queue.ping
-- 99% are below this line --
1002 queue.reportCompleted
1055 secrets.get
1208 queue.reclaimTask
1640 index.findTask
1713 index.findArtifactFromTask
2421 queue.listLatestArtifacts
2463 queue.createTask
2511 queue.pendingTasks
2847 queue.listWorkers
4357 queue.listArtifacts
6748 purge-cache.purgeRequests
6849 queue.getArtifact
10792 queue.createArtifact
12355 queue.getLatestArtifact
18990 queue.task
23313 queue.claimWork
40383 auth.authenticateHawk
252647 queue.status
Overall that's 397442 API calls, or about 663 r/s.
..and 2/3 of that is queue.status
. Yikes!
@sciurus has suggested using bigquery to get a more representative sample of API calls.
Google doesn't let you save data from BigQuery with Firefox, apparently. I for one welcome our new Alphabet Overlords.
SELECT
jsonPayload.servicecontext.service,
jsonPayload.fields.name,
ROUND(COUNT(jsonPayload.fields.name)/(3600*24), 3) as rps
FROM
`moz-fx-taskcluster-prod-4b87.log_storage.stdout_20200226`
GROUP BY
jsonPayload.fields.name, jsonPayload.servicecontext.service
ORDER BY rps DESC
(no great surprises there, really)
As written, this is not breaking a sweat and not causing TC or Pg to break a sweat in my dev environment. I need to add some more simluators (the checkboxes above) but I'm confident that will work just fine.
I've been running this successfully with
ders:
# targetting about 4 tasks/sec, but with some task queues having an excess of workers
# and some having an excess of tasks (createtask will throttle itself to keep a decent
# list of pending tasks)
test-load-create1:
use: createtask
rate: 2
task-queue-id: proj-taskcluster/load-test1
test-load-create2:
use: createtask
rate: 1
task-queue-id: proj-taskcluster/load-test2
test-load-create3:
use: createtask
rate: 1
task-queue-id: proj-taskcluster/load-test3
test-load1:
use: claimwork
parallelism: 100
task-queue-id: proj-taskcluster/load-test1
test-load2:
use: claimwork
parallelism: 100
capacity: 4
task-queue-id: proj-taskcluster/load-test2
test-load3:
use: claimwork
parallelism: 300
task-queue-id: proj-taskcluster/load-test3
# random general API method load generation
gettask:
use: gettask
rate: 31.7
gettaskstatus:
use: gettaskstatus
rate: 421 # 421!
getartifacts:
use: getartifacts
listrate: 11
listlatestrate: 11
getrate: 16
getlatestrate: 49
secrets:
use: secrets
rate: 2
secret: garbage
pendingtasks:
use: pendingtasks
task-queue-ids:
- proj-taskcluster/load-test
- builtin/success
- builtin/failure
rate: 0.5 # on top of the rate from the various load generators, above
against my dev environment running the latest tc-lib-postgres -- so no azure queues or tables at all (even roles have now been converted). This is a steady 800 tps on the DB server, with about 75 active DB connections. DB ingress is 300kiB/s, egress 530KiB/s. CPU utilization is stable at about 45%.
I'll add the additional perf testers described in the linked issues and keep trying.
new repo containing some code to simulate workers, creating tasks, scanning tasks, and other API calls at approximately the level we see in production
TODO: