villasv / aws-airflow-stack

Turbine: the bare metals that gets you Airflow
https://victor.villas/aws-airflow-stack/
MIT License
377 stars 69 forks source link

upstream multiple individual dags #165

Closed RohitJones closed 4 years ago

RohitJones commented 4 years ago

Hi, I'm currently testing out your amazing work to see if it would be a good fit for my organization. In the readme, you've mentioned up-streaming the whole /airflow, I would like to upstream individual dags instead utilizing a ci/cd pipeline.

I've tried to modify the appspec.yml file and change the destination. example: repo_a:

files:
  - source: /a_dag.py
    destination: /airflow/dags/a_dag.py

repo_b:

files:
  - source: /b_dag.py
    destination: /airflow/dags/b_dag.py

I problem I'm facing is that, only the most recent deployment remains on the servers. say I deploy repo_a first followed by repo_b, then in the airflow/dags directory, only b_dag.py is present. the previous deployment is deleted.

I've also tried changing the aws deploy arg file-exists-behavior from OVERWRITE to RETAIN, but there seems to be no difference.

Could you give some insight into this behavior.

villasv commented 4 years ago

Hi @RohitJones. This is a limitation in how CodeDeploy works:

CodeDeploy uses the underlying deployment group ID and AppSpec file to remove all of the files it installed in the previous successful deployment.

My default suggestion is to always deploy all files, even if most of them haven't changed. This is a good practice as each deployment is completely self contained. For example, when a new EC2 worker is added to the workers autoscaling group, it's going to receive only the latest deployment bundle, not all of them sequentially. This means that if your latest deployment contained only a_dag.py , that's all this worker is going to get.

If for some reason you reeaally need to make this work, you can try to use CodeDeploy hooks (like beforeInstall) to move existing files to some temporary location, and another hook to move them back after the deployment finishes installing. I strongly recommend against this, though. It doesn't solve the problem I mentioned earlier about receiving only a portion of the DAGs.

If your CI/CD pipeline is already split up and packaging all dags per deployment is not feasible, I suggest you to change the stack to create multiple CodeDeploy Applications (you can still use the same CodeDeploy DeploymentGroup this stack creates). Then you can deploy each set of DAGs as a different application and have more control over how files are delivered.

Let me know how this goes :-)

villasv commented 4 years ago

I'll be closing this for housekeeping purposes, but feel free to revive the issue if you want to discuss. If others share the same question, it might be worth it to put on the project documentation how to accomplish this.