Open suhlrich opened 5 months ago
Hi @suhlrich. Following this comment I want to provide suggestions on how to implement the "scale up logic", which is in fact - metric submission logic.
cloudwatch:PutMetricsData
permission to submit this metric to CloudWatch. Since you're using Fargate to host the API project I'll create a task to enable this permission on the worker process, so that you can expect the boto3
client just working in the sample code that followssubmit_custom_metric
that should be executed every 1 minute in the scheduled task and submitting the number of trials pending. Make sure that metric_name
is opencap_trials_pending
and namespace
is Custom/opencap-dev
for development and Custom/opencap
for production:
import boto3
def submit_custom_metric(namespace, metric_name, value): """ Submit a custom metric to AWS CloudWatch.
Parameters:
- namespace (str): The namespace for the metric data.
- metric_name (str): The name of the metric.
- value (float): The value associated with the metric.
"""
client = boto3.client('cloudwatch')
response = client.put_metric_data(
Namespace=namespace,
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': 'Count'
}
]
)
return response
def submit_number_of_pending_trials_to_cloudwatch():
response = submit_custom_metric(
'Custom/opencap-dev', # Or 'Custom/opencap' for production
'opencap_trials_pending',
current_queue_length,
)
print("Metric submitted successfully:", response)
Celery will check the queue length every 20 seconds and update the
desired_asg_gpu_instances
variable in cloudwatch accordingly.For counting stopped trials, we can use some of the logic at the beginning of this (https://github.com/stanfordnmbl/opencap-api/blob/a937c734407c2322001c04eb422a9b1eae1e4316/mcserver/views.py#L1401)
Config variables somewhere when celery starts?
autoscale_gpus_on
- some way to toggle whether we are using autoscalingqueue_length_before_scaling_start
- how long will we let the queue be before starting asg machinesqueue_length_before_new_machine
- how many jobs in the queue per asg machine.Logic that gets executed on celery. This is functional python code: