Backend scaling -- stop instance - Githubissues

stanfordnmbl / opencap-core

Main OpenCap processing pipeline

Apache License 2.0

151 stars 120 forks source link

Backend scaling -- stop instance #113

Open suhlrich opened 1 year ago

suhlrich commented 1 year ago

We need to make sure the backend machines aren't processing a job when they stop, so when a machine has not received a job for a certain amount of time, it should exit the app.py loop, remove its scale-in protection, and be ready for autoscaling to turn it off. There needs to be an env variable distinguishing always-on machines from asg machines.

[x] 1) env file: create env var asg_machine . If True, it will pause after inactivity. @suhlrich / @antoinefalisse
- [x] @sashasimkin suggests to use AWS command instead: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-metadata.html
[ ] 2) Add aws keys to env to use AWS CLI @suhlrich or @antoinefalisse
[x] 3) bash script for identifying machine name and removing scale-in protection @sashasimkin
[ ] 4) check desired_asg_gpu_instances from cloudwatch and if it is 0, pause and remove scale-in protection. Should this be an API endpoint so we can keep the IAM permissions minimal for these machines? @suhlrich / @antoinefalisse / @olehkorkh-planeks

sashasimkin commented 5 months ago

I'll provide the bash script to disable the protection later, but the important part of it would be to configure IAM role & instance profile to have permissions to execute that command.
I believe there's no need to have access to any metric and the logic should be if no jobs for X minutes --> pause work
It's important to sleep in the cycle and not stop the process, because upon stopping ECS would restart the container which might have adverse effects of resetting counters.

suhlrich commented 5 months ago

The problem lies with our base capacity. We want to turn off the EC2 instances once the number of jobs falls below some threshold that we think is managable on our on-prem server. If we turn off EC2 instances when they don't get a job, then we only turn them off when the queue is empty, which would result in them remaining on more than necessary if they happen to pick up a job and the on-prem servers are not busy. If we query the number of desired machines, then we can have more complex logic regarding when to turn off.

sashasimkin commented 5 months ago

hey @suhlrich, I understand the base capacity problem and I was wrong regarding inactivity logic. At the same time, what we can do is change logic to be something like:

if number_of_jobs < settings.ONPREM_CAPACITY:
    unprotect_instance()
    pause_work()

this approach leaves the auto-scaling logic to the configured policies in AWS and separates concerns between what the application does and what the infra control plane is doing.

suhlrich commented 5 months ago

Sounds good. Sounds like we should create an API endpoint that returns number_of_jobs. (unless there's an easy way to get it from cloudwatch, but an http request sounds easy to me)

sashasimkin commented 5 months ago

Yes, the endpoint would work, the biggest value of it - it'll provide realtime and the most precise data possible (as this is source of truth).

But cloudwatch might be easier to implement - just adding AWS permissions to the ECS role, and a better fit - we can get an aggregated value over last x min with get_metric_statistics/get_metric_data.

suhlrich commented 5 months ago

how would we query cloudwatch? Would this be another batch script? I guess it is probably better to be looking at the same value that the ASG rule is looking at.

sashasimkin commented 5 months ago

hey @suhlrich , so here's the code that will read the metric submitted by https://github.com/stanfordnmbl/opencap-api/issues/173#issuecomment-2093012528 :

This code requires the cloudwatch:GetMetricStatistics but similar to the as above, I'll add it to the worker ECS service permissions.

import boto3
from datetime import datetime, timedelta

def get_metric_average(namespace, metric_name, start_time, end_time, period):
    """
    Fetch the average value of a specific metric from AWS CloudWatch.

    Parameters:
    - namespace (str): The namespace for the metric data.
    - metric_name (str): The name of the metric.
    - start_time (datetime): Start time for the data retrieval.
    - end_time (datetime): End time for the data retrieval.
    - period (int): The granularity, in seconds, of the data points returned.
    """
    client = boto3.client('cloudwatch')
    response = client.get_metric_statistics(
        Namespace=namespace,
        MetricName=metric_name,
        StartTime=start_time,
        EndTime=end_time,
        Period=period,
        Statistics=['Average']  # Correctly specifying 'Average' here
    )
    return response

def get_number_of_pending_trials():
    # Time range setup for the last 1 minute
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(minutes=1)
    period = 60  # Period in seconds

    # Fetch the metric data
    stats = get_metric_average(
        'Custom/opencap',  # or 'Custom/opencap' for production
        'opencap_trials_pending',
        start_time, end_time, period
    )

    if stats['Datapoints']:
        average = stats['Datapoints'][0]['Average']
        print(f"Average value of '{metric_name}' over the last minute: {average}")
    else:
        print("No data points found.")
        # Maybe raise an exception or do nothing to have control-loop retry this call later
        return None

    return average

sashasimkin commented 5 months ago

@suhlrich here's a bash script (and boto3 code, which should be a better fit) for un-protecting the instance:

#!/bin/bash
# Retrieve the Instance ID
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
# Retrieve the Auto Scaling Group name associated with this instance
ASG_NAME=$(aws autoscaling describe-auto-scaling-instances --instance-ids $INSTANCE_ID --query 'AutoScalingInstances[0].AutoScalingGroupName' --output text)
# Remove protection from this instance in its first(and only) autoscaling group
aws autoscaling set-instance-protection --instance-ids $INSTANCE_ID --auto-scaling-group-name $ASG_NAME --no-protected-from-scale-in

Similar approach but as boto3 to be called from python (as in the #117 ), because you will need boto3 for cloudwatch interactions anyway:

import boto3
import requests

def get_instance_id():
    """Retrieve the instance ID from EC2 metadata."""
    response = requests.get("http://169.254.169.254/latest/meta-data/instance-id")
    return response.text

def get_auto_scaling_group_name(instance_id):
    """Retrieve the Auto Scaling Group name using the instance ID."""
    client = boto3.client('autoscaling')
    response = client.describe_auto_scaling_instances(InstanceIds=[instance_id])
    asg_name = response['AutoScalingInstances'][0]['AutoScalingGroupName']
    return asg_name

def set_instance_protection(instance_id, asg_name, protect):
    """Set or remove instance protection."""
    client = boto3.client('autoscaling')
    client.set_instance_protection(
        InstanceIds=[instance_id],
        AutoScalingGroupName=asg_name,
        ProtectedFromScaleIn=protect
    )

def unprotect_current_instance():
    instance_id = get_instance_id()
    asg_name = get_auto_scaling_group_name(instance_id)
    set_instance_protection(instance_id, asg_name, protect=False)

Either version will require the following permissions for the processing worker process. But it's just for reference as I'll be adding them in terraform.

            "Action": [
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:SetInstanceProtection"
            ],

P.s: Regarding your original point 2) - you don't need to configure any dedicated AWS credentials for this as the permissions will be inferred from the environment through ECS Task role.