pytorch / test-infra

This repository hosts code that supports the testing infrastructure for the PyTorch organization. For example, this repo hosts the logic to track disabled tests and slow tests, as well as our continuation integration jobs HUD/dashboard.
https://hud.pytorch.org/
Other
83 stars 81 forks source link

More robust runner autoscaling #5840

Open ZainRizvi opened 1 month ago

ZainRizvi commented 1 month ago

Goal: Reduce job queuing by increase self hosted runner fleet's autoscaling reliability in the face of failed/dropped scale up requests

The approach

Create a new scheduled lambda function that:

  1. Runs every 15 mins
  2. Queries ClickHouse for jobs that have been queued for over half an hour
  3. Checks the runner types for those jobs to see which ones are self-hosted
  4. Invokes the scale up function to scale up the appropriate number of runners of each type to handle the outstanding jobs