meltano / files-airflow

MIT License
0 stars 5 forks source link

Task chain rather than independent tasks #20

Closed pandemicsyn closed 2 years ago

pandemicsyn commented 2 years ago

After the convo and confusion yesterday about tasks, schedules, jobs and airflow. I wanted to make sure that we're actually generating and representing the tasks in Airflow correctly and I'm seconding guessing a bit what we have.

Today we create 1 dag with indepenent tasks as created in: https://github.com/meltano/files-airflow/pull/18

So given a yaml like:

jobs:
- name: g-to-p-job
  tasks:
  - tap-gitlab target-postgres
  - tap-gitlab target-jsonl

That yields something like:

Screen Shot 2022-06-09 at 10 48 28 AM

Are we sure we don't want the tasks to be linked e.g. task 1 depends on task 0 upstream instead like in the example below.

Screen Shot 2022-06-09 at 10 43 40 AM

pandemicsyn commented 2 years ago

Apologies as logging this as Draft PR rather than an issue but wanted to link the code change as well incase it helps clarify.

pnadolny13 commented 2 years ago

@pandemicsyn I was wondering this as well during the demo. If theyre to be run in order like Taylor described i.e. splitting a single command into tasks, then I'd expect them to have dependencies in Airflow like your second screenshot. Otherwise for tap-csv target-postgres dbt:run that gets split into the yaml below, you would have dbt run potentially at the same time or before the EL completes:

jobs:
- name: g-to-p-job
  tasks:
  - tap-gitlab target-postgres
  - dbt:run
tayloramurphy commented 2 years ago

@pandemicsyn great call out and I'd made a note to follow up about this. Yes, I would expect them to be sequential like you have them in the second picture. The parallel scenario we would eventually support by having nested arrays eventually I think.

pandemicsyn commented 2 years ago

Perfect, sounds like we're all in agreement! I'm gonna go ahead and merge this :shipit: