Open visch opened 2 years ago
@visch can you confirm this is still a bug? @edgarrmondragon have you seen this?
@visch can you confirm this is still a bug? @edgarrmondragon have you seen this?
@tayloramurphy One solution might be to implement what Douwe proposed in the old GitLab issue:
- Have Meltano not set
oauth_credentials
if none of its subproperties are none-null
It seems like there's some conflict between how different plugins treat null
config values. For some, like the SDK in its stream_maps
, null
is a value with special meaning, while for others there's no distinction between null and missing.
Douwe's comment on the old GitLab issue:
This issue has come up again in #2368 (closed), where https://github.com/datamill-co/target-snowflake's
target_s3
setting should either be unset,null
, an empty object{}
, or an object containing nestedaws_access_key_id
,aws_secret_access_key
,bucket
andkey_prefix
properties with non-null
values.Right now, Meltano would generate a nested object with those 4 keys and
null
values if none of thetarget_s3.*
settings are set, instead of omitting thetarget_s3
key entirely or giving it anull
value.Our best bet appears to be to omit keys with
null
values fromconfig.json
entirely, so that there will be notarget_s3.*
keys to unflatten into atarget_s3: {...}
object at all if none of the nested properties have values.
@edgarrmondragon reading through that PR though it seems like Meltano drops config where it is null
b/c a couple of plugins check for existence of the key and not for the value. This seems problematic since it means SDK-build taps can't really use a feature of streams maps with Meltano since it will get ignored.
What's a possible fix here? Should we put in a check on the stream maps keys specifically to not drop the key? This seems like a big deal as it limits the utility of stream maps with Meltano.
@aaronsteers would love your thoughts here.
@visch can you confirm this is still a bug? @edgarrmondragon have you seen this?
I can confirm this is still the behavior
visch@visch-ubuntu:~/git/tap-abc$ meltano --version
meltano, version 2.4.0
config:
stream_maps:
customers_pii:
"email": "email"
__else__: null
visch@visch-ubuntu:~/git/tap-abc$ meltano invoke --dump=config tap-abc
{
"stream_maps": {
"customers_pii": {
"email": "email"
}
}
I haven't taken a deep dive here into the solution but my gut feeling is https://gitlab.com/meltano/meltano/-/issues/2334 was caused by discovery already containing the settings. If discovery already contains the setting then you have to make a choice between returning a null key/value or not having the config.
If my assumption above is the case then I think it makes sense to not return a key/value pair when a setting
is defined but there's no config
provided, and to return a null when someone explicitly provides a null for that setting (this doesn't happen today as this is the bug we're talking about here)
I'm trying to use the terms config/setting as setting
being the "spec" (basically a JSON Schema) for the .config.json
for the tap/target as defined by Meltano, and config
as the value provided by a config
value via (env variables, .env file, meltano.yml, db) while settings
shouldn't return anything unless there's a specific default value given. I don't' have better words for this right now. https://docs.meltano.com/guide/configuration#configuration-layers references config
, setting
, properties
maybe we should define these words more clearly?
@edgarrmondragon @aaronsteers ping on this to get your take. Seems like we may want to have some configurable behavior on whether or not to drop nulls. Could this be an option for capabilities
?
@tayloramurphy and @edgarrmondragon - I don't see a great fix for this. The capabilities
workaround could work, but it feels pretty hacky and bespoke imo. What I don't have good visiblity into is where Meltano may be relying on this behavior of dropping values or excluding them based on a null value, and how pervasive that is in the invocation and configuration codebase.
I see 2 or 3 options that I'll list below: (a) use a feature flag to disable this 'remove-if-null' behavior in Meltano, (b) hard code this just for meltano-map-transform
, or (c) change the SDK convention to not use nulls. I think I lean slightly towards C, but that would be slow to implement, and requires us to agree on a new spec. B is a viable short-term fix - but I don't know the code well enough to know if it is a larger lift that seems easy but actually isn't. (For instance, the code evaluating settings may or may not have access to the plugin name. I like "A" the best because it cleans up the codebase, but I have no idea how much will break when we make that change - or if there are legitimate challenges with not cleaning up null values in our configsets. (E.g. 'null' could be a convention meaning 'use the plugin's native default', which is translated into 'drop the setting'.)
Re the option of a feature flag, it would otherwise be hard to measure the impact in the wild. In the long run, though, that gives us a clean and clear path to change Meltano's default behavior if and when we are in a place to do so.
meltano-map-transform
This would most likely be a temporary workaround while we iterate on either option A or C or both. I like this slightly better than the capabilities
approach because it feels more direct and doesn't require us changing plugin definitions.
If changing the SDK convention, we could add something equivalent to the null value in the SDK mapping/mapper capability that just allows a non-null expression to be equivalent to how we are using null
today. For instance: we could accept something like'NULL'
, remove()
, exclude()
or '__'
- or we could create a __exclude__
dunder operator that replaces the existing null
assignments.
So this sample...
{
"stream_maps": {
"customers": {
// exclude these since we're capturing them in the pii stream
"email": null,
"full_name": null
},
"customers_pii": {
"__source__": "customers",
// include just the PII and the customer_id
"customer_id": "customer_id",
"email": "email",
"full_name": "full_name",
// exclude anything not declared
"__else__": null,
},
},
}
Would become something like:
{
"stream_maps": {
"customers": {
// exclude these since we're capturing them in the pii stream
"__exclude__": ["email", "full_name"]
},
"customers_pii": {
"__source__": "customers",
// include just the PII and the customer_id
"customer_id": "customer_id",
"email": "email",
"full_name": "full_name",
// exclude anything not declared
"__exclude_unmapped__": true
},
},
}
@aaronsteers is there potentially a fourth option where we have a check per plugin on whether or not an env_var is set? I'm thinking:
environments:
- name: prod
config:
plugins:
extractors:
- name: tap-github
env:
NULL_KEEP: true
this would keep nulls for tap-github but if I had it for
environments:
- name: dev
env:
NULL_KEEP: true
it would keep it for everything?
I propose this b/c my only hesitation with the feature flag option is that it's global and I can see users wanting to configure this behavior for specific plugins.
Long-term I like updating the SDK to use something other than null
.
@tayloramurphy - I'm not sure honestly. We'd have to dig in deeper to know how this would actually handle.
For all the trouble this creates in the Meltano codebase, and potential complexities/risks it raises, I'm inclined to just add 'NULL'
as an alias for null
in the SDK - and then instruct Meltano users to send an allcaps string while we work on our longterm solve...
Logged:
Just verified that this doesn't work
visch@visch-ubuntu:~/git/meltano-projects/test_null_key_with_tap$ meltano invoke --dump=config tap-csv
2022-10-13T13:25:07.971681Z [info ] Environment 'dev' is active
{
"stream_maps": {
"customers_pii": {
"email": "email"
}
}
}
visch@visch-ubuntu:~/git/meltano-projects/test_null_key_with_tap$ cat meltano.yml
version: 1
default_environment: dev
environments:
- name: dev
- name: staging
- name: prod
project_id: 8339108a-ba30-48ae-9dcc-8c50702ff803
plugins:
extractors:
- name: tap-csv
variant: meltanolabs
pip_url: git+https://github.com/MeltanoLabs/tap-csv.git
config:
stream_maps:
customers_pii:
"email": "email"
__else__: null
visch@visch-ubuntu:~/git/meltano-projects/test_null_key_with_tap$ meltano --version
meltano, version 2.7.2
Maybe it works with mappers, but not with taps?
Verified that this works with mappers but not taps (Really hard to just get the config for mappers but here we go)
meltano.yml
version: 1
default_environment: dev
environments:
- name: dev
- name: staging
- name: prod
project_id: 8339108a-ba30-48ae-9dcc-8c50702ff803
plugins:
extractors:
- name: tap-csv
variant: meltanolabs
pip_url: git+https://github.com/MeltanoLabs/tap-csv.git
config:
stream_maps:
customers_pii:
email: email
__else__:
files:
- entity: "in"
path: "in.csv"
keys: ["id"]
delimiter: ","
loaders:
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
mappers:
- name: transform-field
variant: transferwise
pip_url: pipelinewise-transform-field
executable: transform-field
mappings:
- name: pipedrive-exclude-custom-fields
config:
stream_maps:
activities:
id: id
deal_title: deal_title
__else__:
transformations:
- field_id: "commit"
tap_stream_name: "commits"
field_paths: ["author/email", "committer/email"]
type: "HASH"
meltano run tap-csv pipedrive-exclude-custom-fields target-jsonl
(monkey patch in a sleep in target-jsonl)
}visch@visch-ubuntu:~/git/meltano-projects/test_null_key_with_tap/.meltano/run/transform-field$ ls
mapper.90aa6695-2412-4909-975c-f1901713d331.config.json
visch@visch-ubuntu:~/git/meltano-projects/test_null_key_with_tap/.meltano/run/transform-field$ cat mapper.90aa6695-2412-4909-975c-f1901713d331.config.json
{
"stream_maps": {
"activities": {
"id": "id",
"deal_title": "deal_title",
"__else__": null
}
},
"transformations": [
{
"field_id": "commit",
"tap_stream_name": "commits",
"field_paths": [
"author/email",
"committer/email"
],
"type": "HASH"
}
]
@tayloramurphy, @sbalnojan
Now that the below merged, I think we can lower priority on this issue. Meltano will still purge null values from mappers' config, but users may now instead send __NULL__
to provide that null
behavior.
In the long run, it's still worth trying to unwind that null
handling behavior, but at a lower priority. (Removed from board.)
@sbalnojan - I think you've been avoiding the meltano-map-transforms
plugin overall in tutorials for this and other reasons, but if any existing references exist, it may also worth updating tutorials/instructions, if/where applicable.
@aaronsteers yes that's one reason, but the main reason is missing documentation https://github.com/meltano/meltano/issues/6412#issuecomment-1277792107.
@sbalnojan: understood. This is not as of now prioritized and I'm not clear if (once prioritized) it would land in DevRel or Engineering:
Is this something you'd feel comfortable picking up, or are there additional items that Engineering can/should pick up to unblock/facilitate that effort?
To the point of this here issue, my point is only that the completion of https://github.com/meltano/sdk/pull/1262 reduces the relative priority of this issue.
@tayloramurphy logging here that we had another user (and myself too) get stuck on this bug in slack today https://meltano.slack.com/archives/C01TCRBBJD7/p1671134681823139.
From my view mappers and stream maps are a differentiating meltano feature that gives flexibility and allows users to hack around broken taps/targets, especially since we are constantly recommending it in slack. It feels like we're very close to something super powerful if it was cleaned up and documented.
After skimming this issue I see that its a challenge to fix in meltano but it is really a meltano bug 😄 . It helps if its fixed in the SDK but all taps/targets need to be upgraded to get support back, even though it already works today and meltano's bug causes the null values to disappear.
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen
label, or request that it be added.
To sync the stream as if it did not contain a primary key, simply set
__key_properties__
tonull
.
We may want to update that: https://github.com/meltano/sdk/issues/2420
Trying to replicate the behavior here
https://sdk.meltano.com/en/latest/stream_maps.html
So if you have a meltano.yml like
And then run
meltano invoke --dump=config tap-name
the else token will not exist.It's not clear that this is a bug as I did some digging here and as a part of https://gitlab.com/meltano/meltano/-/issues/2334 (more specifically https://gitlab.com/meltano/meltano/-/merge_requests/1898 ) it looks like passing null values was removed as it was causing a different issue
UPDATE (AJ, 2022-08-30)
As noted below in https://github.com/meltano/meltano/issues/6382#issuecomment-1208353612, the bug is that Meltano excludes/removes config input values if they are
null
.