openclimatefix / ocf-infrastructure

Infrastructure code for OCF's cloud environments
3 stars 6 forks source link

Which orchestrator to use #265

Closed peterdudfield closed 1 year ago

peterdudfield commented 1 year ago

We currently use AWS ECS to orchestrate our tasks, but just thinking about upgrading. Quick diagram of our set up is roughly here

Tool UI Triggers Managed service cost Self run Comments Next steps
ECS No No Free Free Simple, but cant trigger other ECSs tasks to run
AWS SNS and SQS No Yes cheap no Could use other message services, as dont want to make code AWS specific.
Airflow Yes, but not pretty Yes $400 ~$30 Slightly trickly to run locally see if we can run it on ELB
Dagster Yes, pretty Yes $0.04 / compute minute ~$30 Cant trigger ECS tasks, so would have to upgrade to Kube (+$70 AWS / +$0 GCP look to making ECS task operator
Prefect Yes, but so far can't figure out how to kick of runs/backfills $450 ~$30 figure out trigger tasks on ECS

Interesting questions include

peterdudfield commented 1 year ago

@devsjc @JackKelly just to get thoughts on the paper

JackKelly commented 1 year ago

I'll leave you guys to decide what's best for production - you guys are the experts there!

FWIW, when I do my ML research, I'm planning to use Dagster on-premises to manage my whole "ML data preparation" pipeline. Some advantages I hope to get from using Dagster include:

On the topic of using Dagster on ECS, I'm sure you've seen this already but, if not, here's a Dagster docs page on Launching runs on ECS.

peterdudfield commented 1 year ago

Thanks @JackKelly yea, those are really useful, particularly for the ML on premise things

Yea, we know that Dagster can run on ECS, so run various bits of python code on ECS tasks, but unfortunately Dagster cant tricker docker containers on ECS - https://github.com/dagster-io/dagster/issues/6362

peterdudfield commented 1 year ago

Screenshot 2023-06-27 at 09 24 07

found a way to re run tasks on prefect. Ive not be successful yet in triggering tasks off in ECS

peterdudfield commented 1 year ago

I had to set
task_run["overrides"] = {} to stop another container in the task definition being made

peterdudfield commented 1 year ago

I used this to get a docker compose working for Prefect - https://github.com/flavienbwk/prefect-docker-compose/tree/main

peterdudfield commented 1 year ago

Got airlflow running locally with

---
version: '3.4'

x-common:
  &common
  build: .
  user: "${AIRFLOW_UID}:0"
  env_file: 
    - .env
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    - /var/run/docker.sock:/var/run/docker.sock

x-depends-on:
  &depends-on
  depends_on:
    postgres:
      condition: service_healthy
    airflow-init:
      condition: service_completed_successfully

services:
  postgres:
    image: postgres:13
    container_name: postgres
    ports:
      - "5434:5432"
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 5s
      retries: 5
    env_file:
      - .env

  scheduler:
    <<: *common
    <<: *depends-on
    container_name: airflow-scheduler
    command: scheduler
    restart: on-failure
    ports:
      - "8793:8793"

  webserver:
    <<: *common
    <<: *depends-on
    container_name: airflow-webserver
    restart: always
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 30s
      retries: 5

  airflow-init:
    <<: *common
    container_name: airflow-init
    entrypoint: /bin/bash
    command:
      - -c
      - |
        mkdir -p ./sources/logs ./sources/dags ./sources/plugins
        chown -R "${AIRFLOW_UID}:0" ./sources/{logs,dags,plugins}
        exec /entrypoint airflow version
jrstats commented 1 year ago

Hi Peter, I work in the data team at Ember and have been playing round with dagster for a couple of months. These are a collection of my thoughts:

Positives:

Some negatives:

Overall, I like it so far, and we're going to be trialling a few data pipelines on it over the next couple of weeks.

peterdudfield commented 1 year ago

Thanks, @jrstats really appreciate your input. How did you deploy dagster on GCP? We have looked at some managed services, and found it was a bit to expensive for what we need it, but are exploring deploying from a docker compose file

JackKelly commented 1 year ago

Awesome work!

On the topic of triggering ECS from the data pipeline orchestration tool...

Just curious: Is there a specific reason why we want to run each pipeline step in a separate container on ECS? Rather than, say, running all steps on the same VM, with each step running either in separate processes or in separate containers? Sorry if this is a dumb question!

peterdudfield commented 1 year ago

Awesome work!

On the topic of triggering ECS from the data pipeline orchestration tool...

Just curious: Is there a specific reason why we want to run each pipeline step in a separate container on ECS? Rather than, say, running all steps on the same VM, with each step running either in separate processes or in separate containers? Sorry if this is a dumb question!

We totally could run it all together. I think the answer is module and scalable. Of course, we can still do module and scalable and run it all on one VM. Running it in separate docker containers on ECS makes it

jrstats commented 1 year ago

Thanks, @jrstats really appreciate your input. How did you deploy dagster on GCP? We have looked at some managed services, and found it was a bit to expensive for what we need it, but are exploring deploying from a docker compose file

Yeah, we used a docker compose approach. I found this repo and accompanying article really useful for this.

Similar to this article, we are using a github action to build the containers and push to google artefact repository. The next step of the action is to SSH into a VM that we have set up, pull the containers onto there, and use docker compose to start them up.

peterdudfield commented 1 year ago

Thanks @jrstats

General info: I managed to write a docker compose and deploy it to AWS ELB using terraform today. Few bugs still to work out, but some progress

docker compose file

version: "3"

services:
  # TODO remove and use RDS
  postgres:
    image: postgres:13
    container_name: postgres
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    ports:
      - "5434:5432"

#
  scheduler:
    depends_on:
      - "postgres"
      - "airflowinit"
    image: apache/airflow:2.6.2
    container_name: airflow-scheduler
    command: scheduler
    restart: on-failure
    ports:
      - "8793:8793"
    environment:
      AIRFLOW__CORE__FERNET_KEY: "UKMzEm3yIuFYEq1y3-2FxPNWSVwRASpahmQ9kQfEr8E="
      AIRFLOW__CORE__EXECUTOR: "LocalExecutor"
      AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: "True"
      AIRFLOW__CORE__LOAD_EXAMPLES: "False"
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: "postgresql+psycopg2://airflow:airflow@postgres/airflow"

  webserver:
    image: apache/airflow:2.6.2
    container_name: airflow-webserver
    command: webserver -p 80
    depends_on:
      - "postgres"
      - "airflowinit"
    ports:
      - "80:80"
    restart: always
    environment:
      AIRFLOW__CORE__FERNET_KEY: "UKMzEm3yIuFYEq1y3-2FxPNWSVwRASpahmQ9kQfEr8E="
      AIRFLOW__CORE__EXECUTOR: "LocalExecutor"
      AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: "True"
      AIRFLOW__CORE__LOAD_EXAMPLES: "False"
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: "postgresql+psycopg2://airflow:airflow@postgres/airflow"

  airflowinit:
    image: apache/airflow:2.6.2
    depends_on: ["postgres"]
    environment:
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: "postgresql+psycopg2://airflow:airflow@postgres/airflow"
      _AIRFLOW_DB_UPGRADE: 'True'
      _AIRFLOW_WWW_USER_CREATE: 'True'
      _AIRFLOW_WWW_USER_USERNAME: 'airflow'
      _AIRFLOW_WWW_USER_PASSWORD: 'airflow'
    command: >
      bash -c "pip install apache-airflow[amazon]
      && mkdir -p ./sources/logs ./sources/dags ./sources/plugins
      && airflow db init"