Auto Scaling Central Issue

suhlrich commented 3 months ago

We'd like to have surge GPU capacity using AWS auto-scaling. We will have base capacity that is always running, so this will only activate if the queue is a certain length.

[ ] create the autoscaling group with 5 machines (@antoinefalisse please add spec), default target is 0 @sashasimkin
[ ] add a variable in cloudwatch desired_asg_gpu_instances that will get updated by the celery queue check and checked by the auto-scaling rule. @sashasimkin
[ ] add celery task that checks number of trials and updates desired_asg_gpu_instances on cloudwatch: https://github.com/stanfordnmbl/opencap-api/issues/173 @olehkorkh-planeks
[ ] Pause GPU machine and remove scale-in protection: https://github.com/stanfordnmbl/opencap-core/issues/113. @suhlrich or @antoinefalisse and @sashasimkin
[ ] automatically start EC2 machine with opencap-core docker + IAM roles: https://github.com/stanfordnmbl/opencap-infrastructure/issues/14
[ ] create ASG scaling logic that gets desired_asg_gpu_instances from cloudwatch and spins up/down machines. Spun up machines should have scale-in protection.
[ ] create API endpoint to check desired_asg_gpu_instances. See discussion here https://github.com/stanfordnmbl/opencap-core/issues/113 @olehkorkh-planeks

@olehkorkh-planeks @sashasimkin @antoinefalisse please read over and update this.

sashasimkin commented 3 months ago

Hi @suhlrich, I have a few small comments to the logic:

add a variable in cloudwatch desired_asg_gpu_instances that will get updated by the celery queue check and checked by the auto-scaling rule. @sashasimkin

There's no variables in CloudWatch per-se, all you need to do is call https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudwatch/client/put_metric_data.html from the celery task.

add celery task that checks number of trials and updates desired_asg_gpu_instances on cloudwatch

I advise that instead of having the desired_asg_gpu_instances metric, we should simply put_metric_data, something like the example below.

import boto3
# Initialize the CloudWatch client
cloudwatch = boto3.client('cloudwatch')

metric_data = [{
    'MetricName': 'opencap_trials_pending',
    'Dimensions': [
        {
            'Name': 'Environment',
            'Value': f'{env}'
        }
    ],
    'Timestamp': datetime.datetime.now(),
    'Value': 100,  # the count of pending trials
    'Unit': 'Count'
}]
response = cloudwatch.put_metric_data(
    Namespace='YourApplicationNamespace',
    MetricData=metric_data
)
try:
    response = cloudwatch.put_metric_data(
        Namespace='YourApplicationNamespace',
        MetricData=metric_data
    )
    print("Metric successfully uploaded")
except Exception as e:
    print("Failed to upload metric:", e)

automatically start EC2 machine with opencap-core docker + IAM roles

I advice that we use ECS on EC2 to simplify running of the image that you are pushing to ECR. I saw in the infra repo some related to this code, but it needs checking and polishing to make it working in general and with auto-scaling.

create ASG scaling logic that gets desired_asg_gpu_instances from cloudwatch and spins up/down machines. Spun up machines should have scale-in protection.

This will be just https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-autoscaling-targettracking.html of the agreed value of opencap_trials_pending.

suhlrich commented 3 months ago

@sashasimkin: Is it not possible to use an n_desired_asg_instances variable for the ASG target? This way, we can implement whatever logic we like here (https://github.com/stanfordnmbl/opencap-api/issues/173) that is accessible by the ASG and within the GPU servers so they can know when to shut down:

sashasimkin commented 3 months ago

@suhlrich it is possible to use desired_asg_gpu_instances metric for scaling the number of instances, but it's not how it's usually done - that's why I suggested a different, simpler approach.

In general, the application doesn't manage the number of instances to process the job, but this logic is implemented in the infrastructure layer based on various factors.

I've replied here about termination logic.

suhlrich commented 3 months ago

@sashasimkin So we can implement similar logic to here: https://github.com/stanfordnmbl/opencap-api/issues/173 in the infrastructure level?

sashasimkin commented 3 months ago

@suhlrich yes - exactly, and the logic will be simpler.

I.e. instead of calculating the number of instances and tracking the numbers before/after scaling, we will have simpler target tracking that periodically checks if number of jobs is less or more than 5*n_machines and scales in/out accordingly between mix & max AS group size.

antoinefalisse commented 2 months ago

@sashasimkin let's use g5.2xlarge instances.

sashasimkin commented 2 months ago

@antoinefalisse ✔️ https://github.com/stanfordnmbl/opencap-infrastructure/commit/0c06a317bd3efaa84f66ac26d791c5977df9398e

stanfordnmbl / opencap-api

Auto Scaling Central Issue #174