Closed peterdudfield closed 1 year ago
@devsjc @JackKelly just to get thoughts on the paper
I'll leave you guys to decide what's best for production - you guys are the experts there!
FWIW, when I do my ML research, I'm planning to use Dagster on-premises to manage my whole "ML data preparation" pipeline. Some advantages I hope to get from using Dagster include:
On the topic of using Dagster on ECS, I'm sure you've seen this already but, if not, here's a Dagster docs page on Launching runs on ECS.
Thanks @JackKelly yea, those are really useful, particularly for the ML on premise things
Yea, we know that Dagster can run on ECS, so run various bits of python code on ECS tasks, but unfortunately Dagster cant tricker docker containers on ECS - https://github.com/dagster-io/dagster/issues/6362
found a way to re run tasks on prefect. Ive not be successful yet in triggering tasks off in ECS
I had to set
task_run["overrides"] = {}
to stop another container in the task definition being made
I used this to get a docker compose working for Prefect
- https://github.com/flavienbwk/prefect-docker-compose/tree/main
Got airlflow running locally with
---
version: '3.4'
x-common:
&common
build: .
user: "${AIRFLOW_UID}:0"
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./plugins:/opt/airflow/plugins
- /var/run/docker.sock:/var/run/docker.sock
x-depends-on:
&depends-on
depends_on:
postgres:
condition: service_healthy
airflow-init:
condition: service_completed_successfully
services:
postgres:
image: postgres:13
container_name: postgres
ports:
- "5434:5432"
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
env_file:
- .env
scheduler:
<<: *common
<<: *depends-on
container_name: airflow-scheduler
command: scheduler
restart: on-failure
ports:
- "8793:8793"
webserver:
<<: *common
<<: *depends-on
container_name: airflow-webserver
restart: always
command: webserver
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 30s
retries: 5
airflow-init:
<<: *common
container_name: airflow-init
entrypoint: /bin/bash
command:
- -c
- |
mkdir -p ./sources/logs ./sources/dags ./sources/plugins
chown -R "${AIRFLOW_UID}:0" ./sources/{logs,dags,plugins}
exec /entrypoint airflow version
Hi Peter, I work in the data team at Ember and have been playing round with dagster for a couple of months. These are a collection of my thoughts:
Positives:
Some negatives:
Overall, I like it so far, and we're going to be trialling a few data pipelines on it over the next couple of weeks.
Thanks, @jrstats really appreciate your input. How did you deploy dagster on GCP? We have looked at some managed services, and found it was a bit to expensive for what we need it, but are exploring deploying from a docker compose
file
Awesome work!
On the topic of triggering ECS from the data pipeline orchestration tool...
Just curious: Is there a specific reason why we want to run each pipeline step in a separate container on ECS? Rather than, say, running all steps on the same VM, with each step running either in separate processes or in separate containers? Sorry if this is a dumb question!
Awesome work!
On the topic of triggering ECS from the data pipeline orchestration tool...
Just curious: Is there a specific reason why we want to run each pipeline step in a separate container on ECS? Rather than, say, running all steps on the same VM, with each step running either in separate processes or in separate containers? Sorry if this is a dumb question!
We totally could run it all together.
I think the answer is module
and scalable
. Of course, we can still do module
and scalable
and run it all on one VM. Running it in separate docker containers on ECS makes it
Thanks, @jrstats really appreciate your input. How did you deploy dagster on GCP? We have looked at some managed services, and found it was a bit to expensive for what we need it, but are exploring deploying from a
docker compose
file
Yeah, we used a docker compose approach. I found this repo and accompanying article really useful for this.
Similar to this article, we are using a github action to build the containers and push to google artefact repository. The next step of the action is to SSH into a VM that we have set up, pull the containers onto there, and use docker compose to start them up.
Thanks @jrstats
General info: I managed to write a docker compose and deploy it to AWS ELB using terraform today. Few bugs still to work out, but some progress
docker compose file
version: "3"
services:
# TODO remove and use RDS
postgres:
image: postgres:13
container_name: postgres
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
ports:
- "5434:5432"
#
scheduler:
depends_on:
- "postgres"
- "airflowinit"
image: apache/airflow:2.6.2
container_name: airflow-scheduler
command: scheduler
restart: on-failure
ports:
- "8793:8793"
environment:
AIRFLOW__CORE__FERNET_KEY: "UKMzEm3yIuFYEq1y3-2FxPNWSVwRASpahmQ9kQfEr8E="
AIRFLOW__CORE__EXECUTOR: "LocalExecutor"
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: "True"
AIRFLOW__CORE__LOAD_EXAMPLES: "False"
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: "postgresql+psycopg2://airflow:airflow@postgres/airflow"
webserver:
image: apache/airflow:2.6.2
container_name: airflow-webserver
command: webserver -p 80
depends_on:
- "postgres"
- "airflowinit"
ports:
- "80:80"
restart: always
environment:
AIRFLOW__CORE__FERNET_KEY: "UKMzEm3yIuFYEq1y3-2FxPNWSVwRASpahmQ9kQfEr8E="
AIRFLOW__CORE__EXECUTOR: "LocalExecutor"
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: "True"
AIRFLOW__CORE__LOAD_EXAMPLES: "False"
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: "postgresql+psycopg2://airflow:airflow@postgres/airflow"
airflowinit:
image: apache/airflow:2.6.2
depends_on: ["postgres"]
environment:
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: "postgresql+psycopg2://airflow:airflow@postgres/airflow"
_AIRFLOW_DB_UPGRADE: 'True'
_AIRFLOW_WWW_USER_CREATE: 'True'
_AIRFLOW_WWW_USER_USERNAME: 'airflow'
_AIRFLOW_WWW_USER_PASSWORD: 'airflow'
command: >
bash -c "pip install apache-airflow[amazon]
&& mkdir -p ./sources/logs ./sources/dags ./sources/plugins
&& airflow db init"
We currently use AWS ECS to orchestrate our tasks, but just thinking about upgrading. Quick diagram of our set up is roughly here
Interesting questions include