Add `connection` as first-class construct in `meltano.yml` #3276

Open MeltyBot opened 2 years ago

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/3353

Originally created by @kgpayne on 2022-03-28 16:40:45

Problem to solve

As a user of Meltano, I would like a place in meltano.yml to specify Connection config objects that can be referenced and reused in all of the plugins I add to my Meltano Project.

Target audience

Engineers/developers using Meltano with tools that themselves have native constructs for connections, such as dbt, Airflow, Great Expectations and Superset.

Further details

Connections are a ubiquitous construct across the tools of the data platform, and yet in Meltano these details must still be expressed individually for each plugin as plain config or environment variables. This is not only extra work for users of Meltano, it also adds unnecessary human action in the integration path between plugins accessing the same resources (databases, warehouses, SaaS endpoints etc.).

For example, in the case of Snowflake:

Official Snowflake Python Connector parameters (used under the hood by many python-based tools)
Airflow Snowflake Connection
dbt Snowflake Profile
Superset Snowflake Datasource connection string
Great Expectations Datasource connection string

From the above it is clear that configuration is commonly expressed in two forms, with tools supporting either or both formats (examples from Airflows connection docs):

yaml/json:

snowflake_example_connection:
  conn_type: snowflake
  login: user
  password: password
  host: snowflake.example
  schema: db-schema
  extra:
    account: account
    database: snow-db
    region: us-east
    warehouse: snow-warehouse

URI

snowflake://user:password@snowflake.example/db-schema?account=account&database=snow-db&region=us-east&warehouse=snow-warehouse

Whilst it is common for tools to support the specification of a 'default' connection, the richness of User/Role/Group functionalities in databases and warehouses for controlling access and for cost management (e.g. monitoring Snowflake credit consumption based on per-tool or per-pipeline User/Role definitions) mean that most tools will define or consume several connections based on the execution context. This means a single Project will typically have many more connection definitions than base warehouse and database resources.

Interestingly we already handle connections via a ConnectionService in meltano.core (for use in the older Model functionality), however (as noted in the comments) this has too detailed an understanding of the expected format of each connection type.

Proposal

Expected outcomes:

Allow users to specify and manage connection information via the CLI (including a default connection per type), without the need to explicitly reference that information when adding new compatible plugins.
Allow users to specify a named connection on a per-plugin and per-run basis, or override individual settings from the default connection.

To achieve these two outcomes, we could:

Support connection definitions in discovery.yml.

# discovery.yaml
connections:
  - name: snowflake
    variant: meltano # for cases where there really is a wide variety in ways to configure a particular system, multiple variants could possibly be used?
    dialect: snowflake # as already exists for loaders https://gitlab.com/meltano/meltano/-/blob/master/schema/meltano.schema.json#L421
    settings:
      # supporting all the values listed in the Snowflake docs,
      # with descriptions and kinds (omitted for brevity)
      # https://docs.snowflake.com/en/user-guide/python-connector-api.html#connect
      - name: user
      - name: password
      - name: host
      - name: port
      - name: database
      - name: account
      - name: warehouse
      ...
  - name: postgres
    variant: meltano
    dialect: postgres
    settings:
      # supporting all the values listed in the Postgres/sqlalchemy docs,
      # with descriptions and kinds (omitted for brevity)
      - name: user
      - name: password
      - name: host
      - name: port
      - name: database

Support configuration of connections, including a default per connection dialect. The default could be implied I guess, in cases with only one connection?

# meltano.yml
version: 1
plugins:
  connections:
  - name: postgres
    default: true
    config:
      user: meltano
      password: ${POSTGRES_MELTANO_PASSWORD}
      host: db.example.com
      port: 5432
      database: meltano

  - name: snowflake
    default: true
    config:
      user: "MELTANO"
      password: ${SNOWFLAKE_MELTANO_PASSWORD}
      host: snowflake.example
      account: account_name
      database: snow-db
      ...
  - name: snowflake-dbt
    inherits_from: snowflake
    config:
      user: "DBT"
      password: ${SNOWFLAKE_DBT_PASSWORD}
      # custom settings still supported as normal
      wacky_auth_setting: smoke_signals  # becomes CONNECTION_SNOWFLAKE_WACKY_AUTH_SETTING

  extractors:
  - name: tap-postgres

  loaders:
  - name: target-snowflake

  transformers:
  - name: dbt-snowflake

Support the injection of connection details, without requiring the user to specify them explicitly via refs, using either env_aliases, aliases or the default value field in discovery.yml with env var of the form CONNECTION_<dialect>_<setting name>. E.g.:

# discovery.yml
extractors:
  - name: tap-postgres
    label: PostgreSQL
    description: PostgreSQL database extractor
    namespace: tap_postgres
    variants:
    - name: transferwise
      docs: https://hub.meltano.com/extractors/postgres.html
      repo: https://github.com/transferwise/pipelinewise-tap-postgres
      pip_url: pipelinewise-tap-postgres
      capabilities:
        - catalog
        - discover
        - state
      settings_group_validation:
        - ['host', 'port', 'user', 'password', 'dbname']
      settings:
        - name: host
          value: localhost
          description: PostgreSQL host
          env_aliases: [CONNECTION_POSTGRES_HOST]
        - name: port
          kind: integer
          value: 5432
          description: PostgreSQL port
          env_aliases: [CONNECTION_POSTGRES_PORT]
        - name: user
          description: PostgreSQL user
          env_aliases: [CONNECTION_POSTGRES_USER]
        - name: password
          kind: password
          description: PostgreSQL password
          env_aliases: [CONNECTION_POSTGRES_PASSWORD]
        - name: dbname
          description: PostgreSQL database name
          env_aliases: [CONNECTION_POSTGRES_DATABASE]
        ...

loaders:
  - name: target-snowflake
    label: Snowflake
    description: Snowflake database loader
    namespace: target_snowflake
    dialect: snowflake  # I did not know this already existed 😅
    target_schema: $TARGET_SNOWFLAKE_SCHEMA
    variants:
      - name: transferwise
        docs: https://hub.meltano.com/loaders/snowflake.html
        repo: https://github.com/transferwise/pipelinewise-target-snowflake
        pip_url: pipelinewise-target-snowflake
        settings_group_validation:
          - ['account', 'dbname', 'user', 'password', 'warehouse', 'file_format', 'default_target_schema']
        settings:
          - name: account
            label: Account
            # previous values left for backwards compatibility
            env_aliases: [SF_ACCOUNT, CONNECTION_SNOWFLAKE_ACCOUNT]
            description: Snowflake account name (i.e. rtXXXXX.eu-central-1)
            placeholder: E.g. rtXXXXX.eu-central-1
          - name: dbname
            label: DB Name
            aliases: [database]
            # previous values left for backwards compatibility
            env_aliases: [TARGET_SNOWFLAKE_DATABASE, SF_DATABASE, CONNECTION_SNOWFLAKE_DATABASE] 
            description: Snowflake Database name]
          ...

# dbt profiles
config-version: 2
version: 2
config:
  send_anonymous_usage_stats: True
  use_colors: True
meltano:
  target: snowflake
  outputs:
    snowflake:
      type: snowflake
      threads: 4
      account: "{{ env_var('CONNECTION_SNOWFLAKE_ACCOUNT') }}"
      user: "{{ env_var('CONNECTION_SNOWFLAKE_USER') }}"
      password: "{{ env_var('CONNECTION_SNOWFLAKE_PASSWORD') }}"
      role: "{{ env_var('CONNECTION_SNOWFLAKE_ROLE') }}"
      database: "{{ env_var('CONNECTION_SNOWFLAKE_DATABASE') }}"
      warehouse: "{{ env_var('CONNECTION_SNOWFLAKE_WAREHOUSE') }}"
      schema: "DEFAULT"

Support the selection of a named connection at the plugin level, or at run time. I think the easiest way to control which connection populates the CONNECTION_* for each plugins Context is using plugin extras:

# meltano.yml
environments:
- name: dev
  config:
    plugins:
      transformers:
      - name: dbt-snowflake
        config:
          ...
        # new `connection` plugin extra
        connection_name: snowflake-dbt # this uses the `snowflake-dbt` connection from 2. above to populate the `CONNECTION_SNOWFLAKE_*` env vars for this plugin in this environment

Plugin extras can also be set at runtime using environment variables (as per the docs):

TARGET_SNOWFLAKE__CONNECTION_NAME=snowflake-dbt meltano run tap-gitlab target-snowflake dbt-snowflake:run

Also users can always revert back to configuring the individual settings manually in the plugin config or via an environment:

# meltano.yml
environments:
- name: dev
  config:
    plugins:
      loaders:
      - name: target-snowflake
        config:
          user: "MY_CUSTOM_USER" # overrides CONNECTION_SNOWFLAKE_USER from the default connection in this environment

What does success look like, and how can we measure that?

New users of Meltano are able to get started quickly by specifying default connections once (during onboarding) knowing that any compatible plugins added thereafter will reuse those connections.
Power users retain fine-grained control over which connections are used by which plugin per environment or at runtime.
- e.g. TAP_GITLAB__CONNECTION_NAME=gitlab_dev TARGET_SNOWFLAKE__CONNECTION_NAME=snowflake_ken DBT__CONNECTION_NAME=snowflake_ken meltano run tap-gitlab target-snowflake dbt:run (not particularly pretty, but the env vars could also go in a --environment).
Mapping of Meltano Connection to plugin-specific setting values is part of the plugin definition, stored as part of discovery.yml.
Plugins without connection support still work with existing settings and environment variables, 'as-is'.

Reasons Not to Build

Not particularly DRY; many connections will reference the same warehouse or database with different User/Group/Role. This might be solved with connection inheritance or resource referencing in future iterations. Connections expressed as plugins immediately support inheritance, solving the DRY issue.

Regression test

(Ensure the feature doesn't cause any regressions)

[ ] Write adequate test cases and submit test results
[ ] Test results should be reviewed by a person from the team

Links / references

View 11 previous comments from the original issue on GitLab

I'm also interested in the idea of "Connections" from the perspective of https://github.com/meltano/meltano/issues/2549 (jupyter integration):

Adding abstraction layers builds out the actual platform effect of Meltano; If I define a connection once, every single plugin could auto-detect connections, their types and preload a dozen things. Example that come to my mind:

DataHub could preload a few of the libraries necessary to connect
Jupyter/Great Expectations could preload packages
For Jupyter we could even preload a true Connection object

This would provide a significant advantage over using the stand alone versions of these tools.

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

meltano / meltano