meltano / meltano

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
https://meltano.com/
MIT License
1.81k stars 159 forks source link

Bug: Null config values don't get passed to plugins, preventing Mappers from working as expected #6382

Open visch opened 2 years ago

visch commented 2 years ago

Trying to replicate the behavior here

https://sdk.meltano.com/en/latest/stream_maps.html

{
    "stream_maps": {
        "customers": {
            // exclude these since we're capturing them in the pii stream
            "email": null,
            "full_name": null
        },
        "customers_pii": {
            "__source__": "customers",
            // include just the PII and the customer_id
            "customer_id": "customer_id",
            "email": "email",
            "full_name": "full_name",
            // exclude anything not declared
            "__else__": null,
        },
    },
}

So if you have a meltano.yml like

config:
  stream_maps:
    customers_pii:
      "email": "email"
      __else__: null

And then run meltano invoke --dump=config tap-name the else token will not exist.

It's not clear that this is a bug as I did some digging here and as a part of https://gitlab.com/meltano/meltano/-/issues/2334 (more specifically https://gitlab.com/meltano/meltano/-/merge_requests/1898 ) it looks like passing null values was removed as it was causing a different issue

UPDATE (AJ, 2022-08-30)

As noted below in https://github.com/meltano/meltano/issues/6382#issuecomment-1208353612, the bug is that Meltano excludes/removes config input values if they are null.

I can confirm this is still the behavior

visch@visch-ubuntu:~/git/tap-abc$ meltano --version
meltano, version 2.4.0
      config:
        stream_maps:
          customers_pii:
            "email": "email"
            __else__: null
visch@visch-ubuntu:~/git/tap-abc$ meltano invoke --dump=config tap-abc 
{
  "stream_maps": {
    "customers_pii": {
      "email": "email"
    }
 }
tayloramurphy commented 2 years ago

@visch can you confirm this is still a bug? @edgarrmondragon have you seen this?

edgarrmondragon commented 2 years ago

@visch can you confirm this is still a bug? @edgarrmondragon have you seen this?

@tayloramurphy One solution might be to implement what Douwe proposed in the old GitLab issue:

  1. Have Meltano not set oauth_credentials if none of its subproperties are none-null

It seems like there's some conflict between how different plugins treat null config values. For some, like the SDK in its stream_maps, null is a value with special meaning, while for others there's no distinction between null and missing.

Douwe's comment on the old GitLab issue:

This issue has come up again in #2368 (closed), where https://github.com/datamill-co/target-snowflake's target_s3 setting should either be unset, null, an empty object {}, or an object containing nested aws_access_key_id, aws_secret_access_key, bucket and key_prefix properties with non-null values.

Right now, Meltano would generate a nested object with those 4 keys and null values if none of the target_s3.* settings are set, instead of omitting the target_s3 key entirely or giving it a null value.

Our best bet appears to be to omit keys with null values from config.json entirely, so that there will be no target_s3.* keys to unflatten into a target_s3: {...} object at all if none of the nested properties have values.

tayloramurphy commented 2 years ago

@edgarrmondragon reading through that PR though it seems like Meltano drops config where it is null b/c a couple of plugins check for existence of the key and not for the value. This seems problematic since it means SDK-build taps can't really use a feature of streams maps with Meltano since it will get ignored.

What's a possible fix here? Should we put in a check on the stream maps keys specifically to not drop the key? This seems like a big deal as it limits the utility of stream maps with Meltano.

@aaronsteers would love your thoughts here.

visch commented 2 years ago

@visch can you confirm this is still a bug? @edgarrmondragon have you seen this?

I can confirm this is still the behavior

visch@visch-ubuntu:~/git/tap-abc$ meltano --version
meltano, version 2.4.0
      config:
        stream_maps:
          customers_pii:
            "email": "email"
            __else__: null
visch@visch-ubuntu:~/git/tap-abc$ meltano invoke --dump=config tap-abc 
{
  "stream_maps": {
    "customers_pii": {
      "email": "email"
    }
 }
visch commented 2 years ago

I haven't taken a deep dive here into the solution but my gut feeling is https://gitlab.com/meltano/meltano/-/issues/2334 was caused by discovery already containing the settings. If discovery already contains the setting then you have to make a choice between returning a null key/value or not having the config.

If my assumption above is the case then I think it makes sense to not return a key/value pair when a setting is defined but there's no config provided, and to return a null when someone explicitly provides a null for that setting (this doesn't happen today as this is the bug we're talking about here)

I'm trying to use the terms config/setting as setting being the "spec" (basically a JSON Schema) for the .config.json for the tap/target as defined by Meltano, and config as the value provided by a config value via (env variables, .env file, meltano.yml, db) while settings shouldn't return anything unless there's a specific default value given. I don't' have better words for this right now. https://docs.meltano.com/guide/configuration#configuration-layers references config , setting, properties maybe we should define these words more clearly?

tayloramurphy commented 2 years ago

@edgarrmondragon @aaronsteers ping on this to get your take. Seems like we may want to have some configurable behavior on whether or not to drop nulls. Could this be an option for capabilities?

aaronsteers commented 2 years ago

@tayloramurphy and @edgarrmondragon - I don't see a great fix for this. The capabilities workaround could work, but it feels pretty hacky and bespoke imo. What I don't have good visiblity into is where Meltano may be relying on this behavior of dropping values or excluding them based on a null value, and how pervasive that is in the invocation and configuration codebase.

I see 2 or 3 options that I'll list below: (a) use a feature flag to disable this 'remove-if-null' behavior in Meltano, (b) hard code this just for meltano-map-transform, or (c) change the SDK convention to not use nulls. I think I lean slightly towards C, but that would be slow to implement, and requires us to agree on a new spec. B is a viable short-term fix - but I don't know the code well enough to know if it is a larger lift that seems easy but actually isn't. (For instance, the code evaluating settings may or may not have access to the plugin name. I like "A" the best because it cleans up the codebase, but I have no idea how much will break when we make that change - or if there are legitimate challenges with not cleaning up null values in our configsets. (E.g. 'null' could be a convention meaning 'use the plugin's native default', which is translated into 'drop the setting'.)

Option A - Feature flag.

Re the option of a feature flag, it would otherwise be hard to measure the impact in the wild. In the long run, though, that gives us a clean and clear path to change Meltano's default behavior if and when we are in a place to do so.

Option B - Hard-coded logic for meltano-map-transform

This would most likely be a temporary workaround while we iterate on either option A or C or both. I like this slightly better than the capabilities approach because it feels more direct and doesn't require us changing plugin definitions.

Option C - Change the SDK

If changing the SDK convention, we could add something equivalent to the null value in the SDK mapping/mapper capability that just allows a non-null expression to be equivalent to how we are using null today. For instance: we could accept something like'NULL', remove(), exclude() or '__' - or we could create a __exclude__ dunder operator that replaces the existing null assignments.

So this sample...

{
    "stream_maps": {
        "customers": {
            // exclude these since we're capturing them in the pii stream
            "email": null,
            "full_name": null
        },
        "customers_pii": {
            "__source__": "customers",
            // include just the PII and the customer_id
            "customer_id": "customer_id",
            "email": "email",
            "full_name": "full_name",
            // exclude anything not declared
            "__else__": null,
        },
    },
}

Would become something like:

{
    "stream_maps": {
        "customers": {
            // exclude these since we're capturing them in the pii stream
            "__exclude__": ["email", "full_name"]
        },
        "customers_pii": {
            "__source__": "customers",
            // include just the PII and the customer_id
            "customer_id": "customer_id",
            "email": "email",
            "full_name": "full_name",
            // exclude anything not declared
            "__exclude_unmapped__": true
        },
    },
}
tayloramurphy commented 2 years ago

@aaronsteers is there potentially a fourth option where we have a check per plugin on whether or not an env_var is set? I'm thinking:

environments:
  - name: prod
    config:
      plugins:
        extractors:
          - name: tap-github
            env:
              NULL_KEEP: true

this would keep nulls for tap-github but if I had it for

environments:
  - name: dev
    env:
      NULL_KEEP: true

it would keep it for everything?

I propose this b/c my only hesitation with the feature flag option is that it's global and I can see users wanting to configure this behavior for specific plugins.

Long-term I like updating the SDK to use something other than null.

aaronsteers commented 2 years ago

@tayloramurphy - I'm not sure honestly. We'd have to dig in deeper to know how this would actually handle.

For all the trouble this creates in the Meltano codebase, and potential complexities/risks it raises, I'm inclined to just add 'NULL' as an alias for null in the SDK - and then instruct Meltano users to send an allcaps string while we work on our longterm solve...

Logged:

visch commented 1 year ago

Just verified that this doesn't work

visch@visch-ubuntu:~/git/meltano-projects/test_null_key_with_tap$ meltano invoke --dump=config tap-csv
2022-10-13T13:25:07.971681Z [info     ] Environment 'dev' is active
{
  "stream_maps": {
    "customers_pii": {
      "email": "email"
    }
  }
}
visch@visch-ubuntu:~/git/meltano-projects/test_null_key_with_tap$ cat meltano.yml
version: 1
default_environment: dev
environments:
- name: dev
- name: staging
- name: prod
project_id: 8339108a-ba30-48ae-9dcc-8c50702ff803
plugins:
  extractors:
  - name: tap-csv
    variant: meltanolabs
    pip_url: git+https://github.com/MeltanoLabs/tap-csv.git
    config:
      stream_maps:
        customers_pii:
          "email": "email"
          __else__: null

visch@visch-ubuntu:~/git/meltano-projects/test_null_key_with_tap$ meltano --version
meltano, version 2.7.2

Maybe it works with mappers, but not with taps?

visch commented 1 year ago

Verified that this works with mappers but not taps (Really hard to just get the config for mappers but here we go)

meltano.yml

version: 1
default_environment: dev
environments:
- name: dev
- name: staging
- name: prod
project_id: 8339108a-ba30-48ae-9dcc-8c50702ff803
plugins:
  extractors:
  - name: tap-csv
    variant: meltanolabs
    pip_url: git+https://github.com/MeltanoLabs/tap-csv.git
    config:
      stream_maps:
        customers_pii:
          email: email
          __else__:
      files:
        - entity: "in"
          path: "in.csv"
          keys: ["id"]
          delimiter: ","
  loaders:
  - name: target-jsonl
    variant: andyh1203
    pip_url: target-jsonl
  mappers:
  - name: transform-field
    variant: transferwise
    pip_url: pipelinewise-transform-field
    executable: transform-field
    mappings:
    - name: pipedrive-exclude-custom-fields
      config:
        stream_maps:
          activities:
            id: id
            deal_title: deal_title
            __else__:
        transformations:
          - field_id: "commit"
            tap_stream_name: "commits"
            field_paths: ["author/email", "committer/email"]
            type: "HASH"

meltano run tap-csv pipedrive-exclude-custom-fields target-jsonl

(monkey patch in a sleep in target-jsonl)

}visch@visch-ubuntu:~/git/meltano-projects/test_null_key_with_tap/.meltano/run/transform-field$ ls
mapper.90aa6695-2412-4909-975c-f1901713d331.config.json
visch@visch-ubuntu:~/git/meltano-projects/test_null_key_with_tap/.meltano/run/transform-field$ cat mapper.90aa6695-2412-4909-975c-f1901713d331.config.json
{
  "stream_maps": {
    "activities": {
      "id": "id",
      "deal_title": "deal_title",
      "__else__": null
    }
  },
  "transformations": [
    {
      "field_id": "commit",
      "tap_stream_name": "commits",
      "field_paths": [
        "author/email",
        "committer/email"
      ],
      "type": "HASH"
    }
  ]
aaronsteers commented 1 year ago

@tayloramurphy, @sbalnojan

Now that the below merged, I think we can lower priority on this issue. Meltano will still purge null values from mappers' config, but users may now instead send __NULL__ to provide that null behavior.

In the long run, it's still worth trying to unwind that null handling behavior, but at a lower priority. (Removed from board.)

@sbalnojan - I think you've been avoiding the meltano-map-transforms plugin overall in tutorials for this and other reasons, but if any existing references exist, it may also worth updating tutorials/instructions, if/where applicable.

sbalnojan commented 1 year ago

@aaronsteers yes that's one reason, but the main reason is missing documentation https://github.com/meltano/meltano/issues/6412#issuecomment-1277792107.

aaronsteers commented 1 year ago

@sbalnojan: understood. This is not as of now prioritized and I'm not clear if (once prioritized) it would land in DevRel or Engineering:

Is this something you'd feel comfortable picking up, or are there additional items that Engineering can/should pick up to unblock/facilitate that effort?

To the point of this here issue, my point is only that the completion of https://github.com/meltano/sdk/pull/1262 reduces the relative priority of this issue.

pnadolny13 commented 1 year ago

@tayloramurphy logging here that we had another user (and myself too) get stuck on this bug in slack today https://meltano.slack.com/archives/C01TCRBBJD7/p1671134681823139.

From my view mappers and stream maps are a differentiating meltano feature that gives flexibility and allows users to hack around broken taps/targets, especially since we are constantly recommending it in slack. It feels like we're very close to something super powerful if it was cleaned up and documented.

After skimming this issue I see that its a challenge to fix in meltano but it is really a meltano bug 😄 . It helps if its fixed in the SDK but all taps/targets need to be upgraded to get support back, even though it already works today and meltano's bug causes the null values to disappear.

stale[bot] commented 1 year ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

edgarrmondragon commented 5 months ago

The SDK docs mention:

To sync the stream as if it did not contain a primary key, simply set __key_properties__ to null.

We may want to update that: https://github.com/meltano/sdk/issues/2420