openedx / openedx-aspects

Aspects - Analytics for Open edX
Apache License 2.0
6 stars 7 forks source link

Add helm charts to Harmony #116

Open bmtcril opened 8 months ago

bmtcril commented 8 months ago

To support scalable deployments of the Aspects infrastructure, we would like to add the EduNEXT production helm charts to the Harmony project. Specifically these would support:

Ian2012 commented 1 month ago

Autoscaling

Autoscaling can be implemented using tutor-contrib-pod-autoscaling:

from tutorpod_autoscaling.hooks import AUTOSCALING_CONFIG

@AUTOSCALING_CONFIG.add()
def _add_my_autoscaling(autoscaling_config):
    autoscaling_config["ralph"] = {
        "enable_hpa": True,
        "memory_request": "300Mi",
        "cpu_request": 0.25,
        "memory_limit": "1200Mi",
        "cpu_limit": 1,
        "min_replicas": 1,
        "max_replicas": 10,
        "avg_cpu": 300,
        "avg_memory": "",
        "enable_vpa": False,
    }
    autoscaling_config["superset"] = {
        "enable_hpa": True,
        "memory_request": "300Mi",
        "cpu_request": 0.25,
        "memory_limit": "1200Mi",
        "cpu_limit": 1,
        "min_replicas": 1,
        "max_replicas": 10,
        "avg_cpu": 300,
        "avg_memory": "",
        "enable_vpa": False,
    }
    return autoscaling_config

For the actual values, we can reference the Ralph Helm Chart and the Superset Helm Chart. We don't use the superset workers extensively, but it would be a good addition to have autoscaling values for it too.

Celery

The default celery workers are run using a process pool that assumes all tasks are CPU intensive, however, Aspects tasks are mainly I/O bound, as they perform either a call or set of calls to Redis (for batching) or to Ralph (which makes another call to ClickHouse) and are most of the time CPU idle. At edunext, we have developed a tutor Celery plugin to manage multiple queues for Celery. With it, we have tested switching to a gevent pool which uses lightweight threads on the default lms worker deployment with concurrency set to 100 events. It improved a lot the performance of the tasks.

The plan would be:

ClickHouse

Support for the ClickHouse operator will be added to Harmony, and examples with documentation for running on production with Aspects.