Open bmtcril opened 8 months ago
Autoscaling can be implemented using tutor-contrib-pod-autoscaling
:
from tutorpod_autoscaling.hooks import AUTOSCALING_CONFIG
@AUTOSCALING_CONFIG.add()
def _add_my_autoscaling(autoscaling_config):
autoscaling_config["ralph"] = {
"enable_hpa": True,
"memory_request": "300Mi",
"cpu_request": 0.25,
"memory_limit": "1200Mi",
"cpu_limit": 1,
"min_replicas": 1,
"max_replicas": 10,
"avg_cpu": 300,
"avg_memory": "",
"enable_vpa": False,
}
autoscaling_config["superset"] = {
"enable_hpa": True,
"memory_request": "300Mi",
"cpu_request": 0.25,
"memory_limit": "1200Mi",
"cpu_limit": 1,
"min_replicas": 1,
"max_replicas": 10,
"avg_cpu": 300,
"avg_memory": "",
"enable_vpa": False,
}
return autoscaling_config
For the actual values, we can reference the Ralph Helm Chart and the Superset Helm Chart. We don't use the superset workers extensively, but it would be a good addition to have autoscaling values for it too.
The default celery workers are run using a process pool that assumes all tasks are CPU intensive, however, Aspects tasks are mainly I/O bound, as they perform either a call or set of calls to Redis (for batching) or to Ralph (which makes another call to ClickHouse) and are most of the time CPU idle. At edunext, we have developed a tutor Celery plugin to manage multiple queues for Celery. With it, we have tested switching to a gevent
pool which uses lightweight threads on the default lms worker deployment with concurrency set to 100 events. It improved a lot the performance of the tasks.
The plan would be:
gevent
as a dependency of edx-platform.Support for the ClickHouse operator will be added to Harmony, and examples with documentation for running on production with Aspects.
To support scalable deployments of the Aspects infrastructure, we would like to add the EduNEXT production helm charts to the Harmony project. Specifically these would support: