gocardless / airflow-dbt

Apache Airflow integration for dbt
https://pypi.org/project/airflow-dbt/
MIT License
392 stars 62 forks source link

Use Airflow's own connections in dbt profiles #39

Open akerone opened 3 years ago

akerone commented 3 years ago

It would be a great feature if we could use Airflow's own "Connections" section instead of having to provide usernames/passwords/keyfiles in profiles.yml.

I'm thinking specifically in using dbt in Google Cloud Composer, which provides a "bigquery_default" connection that other Operators can use. This is just an example, and could definitely be applied to every Airflow installation.

It would also help with maintaining the DRY principle, by not having to provide the same credentials in two separate parts of the code (airflow connections and profiles.yml).

dinigo commented 2 years ago

They can all be abstracted through environment variables. You can add those at provision time (when the instance is created, or either in Kubernetes/docker-compose/.profile/gcp composer). Currently airflow-dbt does not support passing env variables to the process running the command at runtime, so they should be defined on before hand.

profile:
  target: prod
  outputs:
    prod:
      type: postgres
      host: 127.0.0.1
      # IMPORTANT: Make sure to quote the entire Jinja string here
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"

As this is such a common practice I have added this functionality by default to the PR #45 After merging my PR you would be able to use those by declaring the env dictionary. For example:

    DbtTestOperator(
        task_id='run_dbt',
        project_dir=abspath('./jaffle-shop'),
        profiles_dir=abspath('.'),
        dbt_bin='/home/airflow/.local/bin/dbt',
        env={
            'DBT_USER': 'test-user',
            'DBT_PASSWORD': 'test-pasword'
    }

Or the more common "use a service account" mounted as a secret in a container.

    DbtRunOperator(
        task_id='run_dbt',
        project_dir=abspath('./jaffle-shop'),
        profiles_dir=abspath('.'),
        dbt_bin='/home/airflow/.local/bin/dbt',
        env={
            'GOOGLE_APPLICATION_CREDENTIALS': '/run/secrets/gcp_credentials.json'
    },

However DBT authentication is far too vague to declare it in a connection (other than the extras... maybe...). But that's my opinion anyway.

akerone commented 2 years ago

Thanks for the comment. ENV variables can definitely help in a lot of use cases, and probably bump this issue down into a much lower priority

However DBT authentication is far too vague to declare it in a connection (other than the extras... maybe...). But that's my opinion anyway.

In Google's Cloud Composer, Airflow is deployed with a set of preexisting connections (including its own GCP project with a Service Account Key in the extras section), and so it would be great if the profiles.yml could be built from one of those connections, rather than having to specify the env in each DAG we deploy.