meltano / meltano

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
https://meltano.com/
MIT License
1.85k stars 166 forks source link

Add `connection` as first-class construct in `meltano.yml` #3276

Open MeltyBot opened 2 years ago

MeltyBot commented 2 years ago

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/3353

Originally created by @kgpayne on 2022-03-28 16:40:45


Problem to solve

As a user of Meltano, I would like a place in meltano.yml to specify Connection config objects that can be referenced and reused in all of the plugins I add to my Meltano Project.

Target audience

Engineers/developers using Meltano with tools that themselves have native constructs for connections, such as dbt, Airflow, Great Expectations and Superset.

Further details

Connections are a ubiquitous construct across the tools of the data platform, and yet in Meltano these details must still be expressed individually for each plugin as plain config or environment variables. This is not only extra work for users of Meltano, it also adds unnecessary human action in the integration path between plugins accessing the same resources (databases, warehouses, SaaS endpoints etc.).

For example, in the case of Snowflake:

From the above it is clear that configuration is commonly expressed in two forms, with tools supporting either or both formats (examples from Airflows connection docs):

Whilst it is common for tools to support the specification of a 'default' connection, the richness of User/Role/Group functionalities in databases and warehouses for controlling access and for cost management (e.g. monitoring Snowflake credit consumption based on per-tool or per-pipeline User/Role definitions) mean that most tools will define or consume several connections based on the execution context. This means a single Project will typically have many more connection definitions than base warehouse and database resources.

Interestingly we already handle connections via a ConnectionService in meltano.core (for use in the older Model functionality), however (as noted in the comments) this has too detailed an understanding of the expected format of each connection type.

Proposal

Expected outcomes:

To achieve these two outcomes, we could:

  1. Support connection definitions in discovery.yml.
# discovery.yaml
connections:
  - name: snowflake
    variant: meltano # for cases where there really is a wide variety in ways to configure a particular system, multiple variants could possibly be used?
    dialect: snowflake # as already exists for loaders https://gitlab.com/meltano/meltano/-/blob/master/schema/meltano.schema.json#L421
    settings:
      # supporting all the values listed in the Snowflake docs,
      # with descriptions and kinds (omitted for brevity)
      # https://docs.snowflake.com/en/user-guide/python-connector-api.html#connect
      - name: user
      - name: password
      - name: host
      - name: port
      - name: database
      - name: account
      - name: warehouse
      ...
  - name: postgres
    variant: meltano
    dialect: postgres
    settings:
      # supporting all the values listed in the Postgres/sqlalchemy docs,
      # with descriptions and kinds (omitted for brevity)
      - name: user
      - name: password
      - name: host
      - name: port
      - name: database
  1. Support configuration of connections, including a default per connection dialect. The default could be implied I guess, in cases with only one connection?
# meltano.yml
version: 1
plugins:
  connections:
  - name: postgres
    default: true
    config:
      user: meltano
      password: ${POSTGRES_MELTANO_PASSWORD}
      host: db.example.com
      port: 5432
      database: meltano

  - name: snowflake
    default: true
    config:
      user: "MELTANO"
      password: ${SNOWFLAKE_MELTANO_PASSWORD}
      host: snowflake.example
      account: account_name
      database: snow-db
      ...
  - name: snowflake-dbt
    inherits_from: snowflake
    config:
      user: "DBT"
      password: ${SNOWFLAKE_DBT_PASSWORD}
      # custom settings still supported as normal
      wacky_auth_setting: smoke_signals  # becomes CONNECTION_SNOWFLAKE_WACKY_AUTH_SETTING

  extractors:
  - name: tap-postgres

  loaders:
  - name: target-snowflake

  transformers:
  - name: dbt-snowflake
  1. Support the injection of connection details, without requiring the user to specify them explicitly via refs, using either env_aliases, aliases or the default value field in discovery.yml with env var of the form CONNECTION_<dialect>_<setting name>. E.g.:
# discovery.yml
extractors:
  - name: tap-postgres
    label: PostgreSQL
    description: PostgreSQL database extractor
    namespace: tap_postgres
    variants:
    - name: transferwise
      docs: https://hub.meltano.com/extractors/postgres.html
      repo: https://github.com/transferwise/pipelinewise-tap-postgres
      pip_url: pipelinewise-tap-postgres
      capabilities:
        - catalog
        - discover
        - state
      settings_group_validation:
        - ['host', 'port', 'user', 'password', 'dbname']
      settings:
        - name: host
          value: localhost
          description: PostgreSQL host
          env_aliases: [CONNECTION_POSTGRES_HOST]
        - name: port
          kind: integer
          value: 5432
          description: PostgreSQL port
          env_aliases: [CONNECTION_POSTGRES_PORT]
        - name: user
          description: PostgreSQL user
          env_aliases: [CONNECTION_POSTGRES_USER]
        - name: password
          kind: password
          description: PostgreSQL password
          env_aliases: [CONNECTION_POSTGRES_PASSWORD]
        - name: dbname
          description: PostgreSQL database name
          env_aliases: [CONNECTION_POSTGRES_DATABASE]
        ...

loaders:
  - name: target-snowflake
    label: Snowflake
    description: Snowflake database loader
    namespace: target_snowflake
    dialect: snowflake  # I did not know this already existed 😅
    target_schema: $TARGET_SNOWFLAKE_SCHEMA
    variants:
      - name: transferwise
        docs: https://hub.meltano.com/loaders/snowflake.html
        repo: https://github.com/transferwise/pipelinewise-target-snowflake
        pip_url: pipelinewise-target-snowflake
        settings_group_validation:
          - ['account', 'dbname', 'user', 'password', 'warehouse', 'file_format', 'default_target_schema']
        settings:
          - name: account
            label: Account
            # previous values left for backwards compatibility
            env_aliases: [SF_ACCOUNT, CONNECTION_SNOWFLAKE_ACCOUNT]
            description: Snowflake account name (i.e. rtXXXXX.eu-central-1)
            placeholder: E.g. rtXXXXX.eu-central-1
          - name: dbname
            label: DB Name
            aliases: [database]
            # previous values left for backwards compatibility
            env_aliases: [TARGET_SNOWFLAKE_DATABASE, SF_DATABASE, CONNECTION_SNOWFLAKE_DATABASE] 
            description: Snowflake Database name]
          ...
# dbt profiles
config-version: 2
version: 2
config:
  send_anonymous_usage_stats: True
  use_colors: True
meltano:
  target: snowflake
  outputs:
    snowflake:
      type: snowflake
      threads: 4
      account: "{{ env_var('CONNECTION_SNOWFLAKE_ACCOUNT') }}"
      user: "{{ env_var('CONNECTION_SNOWFLAKE_USER') }}"
      password: "{{ env_var('CONNECTION_SNOWFLAKE_PASSWORD') }}"
      role: "{{ env_var('CONNECTION_SNOWFLAKE_ROLE') }}"
      database: "{{ env_var('CONNECTION_SNOWFLAKE_DATABASE') }}"
      warehouse: "{{ env_var('CONNECTION_SNOWFLAKE_WAREHOUSE') }}"
      schema: "DEFAULT"
  1. Support the selection of a named connection at the plugin level, or at run time. I think the easiest way to control which connection populates the CONNECTION_* for each plugins Context is using plugin extras:
# meltano.yml
environments:
- name: dev
  config:
    plugins:
      transformers:
      - name: dbt-snowflake
        config:
          ...
        # new `connection` plugin extra
        connection_name: snowflake-dbt # this uses the `snowflake-dbt` connection from 2. above to populate the `CONNECTION_SNOWFLAKE_*` env vars for this plugin in this environment

Plugin extras can also be set at runtime using environment variables (as per the docs):

TARGET_SNOWFLAKE__CONNECTION_NAME=snowflake-dbt meltano run tap-gitlab target-snowflake dbt-snowflake:run

Also users can always revert back to configuring the individual settings manually in the plugin config or via an environment:

# meltano.yml
environments:
- name: dev
  config:
    plugins:
      loaders:
      - name: target-snowflake
        config:
          user: "MY_CUSTOM_USER" # overrides CONNECTION_SNOWFLAKE_USER from the default connection in this environment

What does success look like, and how can we measure that?

Reasons Not to Build

Regression test

(Ensure the feature doesn't cause any regressions)

Links / references

MeltyBot commented 2 years ago

View 11 previous comments from the original issue on GitLab

sbalnojan commented 2 years ago

I'm also interested in the idea of "Connections" from the perspective of https://github.com/meltano/meltano/issues/2549 (jupyter integration):

Adding abstraction layers builds out the actual platform effect of Meltano; If I define a connection once, every single plugin could auto-detect connections, their types and preload a dozen things. Example that come to my mind:

This would provide a significant advantage over using the stand alone versions of these tools.

stale[bot] commented 1 year ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.