Open aaronsteers opened 1 year ago
@aaronsteers this is great 🙌 From a Meltano perspective, I think it would be preferable to maintain syntax parity between 'filter' (ignore/include) features in the SDK and those produced/expected by --select
and --exclude
. This will both limit the opportunity for user-error type bugs (e.g. two devs in the same Meltano Project using two approaches with two different syntaxes to try and achieve the same ultimate selection) and allow us to "push down" selection/exclude criteria directly from Meltano (possibly with env var expansion) into config.json
for the Tap to apply during discovery/run.
E.g. accepting the patterns as produced from the --select
/--exclude
commands docs, allowing users to configure pre-discovery and post-discovery filtering in one place and in one syntax 🤯
Enabled patterns:
tags.*
commits.id
commits.project_id
commits.created_at
commits.author_name
commits.message
!*.*_url
Does that make sense?
@kgpayne - Agreed: if we can use the same syntax as --exclude
uses in Meltano today, then it could be a passthrough to the SDK if the tap has ignore-patterns
(or similar) as a capability.
The passthrough is also a more performant implementation, exactly because it short-circuits the discovery process on those streams when supported, also reducing the size of the generated catalog.
@aaronsteers great 👏 If we follow the Meltano select convention, I think your examples in the issue description reverse 🤔 I.e. "*" would become "select all" and !users
and !customers
would mean "exclude users and customers", as they are in Meltano? Just checking I am still following.
On naming, would it make sense to call these more generically filter-patterns
or similar in the SDK config, as they support both include and exclude (via negation)?
@aaronsteers great 👏 If we follow the Meltano select convention, I think your examples in the issue description reverse 🤔 I.e. "*" would become "select all" and
!users
and!customers
would mean "exclude users and customers", as they are in Meltano? Just checking I am still following.
Sorry. I did not mean to suggest to use --select
and --exclude
, and I still believe as I mentioned in my writeup that we should make this a new ignore
feature and not conflated with select/deselect.
My point was just that we can use the syntax of the rules, but applied to ignore
.
@aaronsteers ah, ok. Thanks for clarifying.
How do you see the common ask of "limit discovery to selected streams" working with ignore
? With full separation between selection and ignore pattern syntaxes (with ignore being the opposite of select), would it not be the case that a user would first have to use meltano select tap-example users "*"
to select a stream and then additionally meltano config tap-example ignore "!users.*"
then meltano config tap-example ignore "*"
to ensure only the users
stream is discovered? Repeated for each selection and remembering to negate their selection for ignore
?
I agree that ignore
patterns as described are different in when they apply (ignore
is applied pre-discovery in the SDK, and select
applies post-discovery in Meltano on catalog.json
) but what the patterns refer to is the same - included or excluded streams and stream properties.
So by 'push down' I imagined that the select patterns, supported in the same format by both Meltano and the SDK, could be injected verbatim from meltano.yml
select
extra into the new setting in config.json
to achieve the "limit discovery to selected streams" use case. This behaviour would be a feature of Meltano (behind a config extra flag in meltano (e.g. limit_discovery_to_selection
) and would be a merge with any other patterns already defined in config directly, to allow users to leverage the other capabilities you mentioned. This would mean, for the common case, selection defined once (one place and in one format) then applied twice - pre and post discovery - with the option of configuring additional pre-discovery rules (including broad ignores) as needed.
Does that make sense? Maybe "limit discovery to selected streams" isn't a perfect fit for ignore
?
With full separation between selection and ignore pattern syntaxes (with ignore being the opposite of select), would it not be the case that a user would first have to use
meltano select tap-example users "*"
to select a stream and then additionallymeltano config tap-example ignore "!users.*"
thenmeltano config tap-example ignore "*"
to ensure only theusers
stream is discovered? Repeated for each selection and remembering to negate their selection forignore
?
I don't think it's as duplicative as that - specifying the rule either in select or in exclude should be sufficient. I don't think you need anything in the select rules except '' - and that is only needed if the tap does not select its streams by default. So, the ignore rule would just be `['', '!users']- and to add one more table you'd expand it with one additional item
['', '!users', '!customers']. If we're choosing here to really lock down the
ignorerule very strictly, a simple pairing with select of
''should be fully sufficient. Again, this isn't really the main use case for ignore, since
selectis a better match for this use case. Specifically, you'd want to use
selectin this case because you want to see that 'users' and 'orders' are selected, but not 'addresses'. If you use ignore instead of select, then you lose visibility to
addresses` being present and deselected in the source.
A better-matched use case would be if we have a tap with a three-part stream name containing '<db>-<schema>-<table>'
, and the source has three databases: 'prod', 'test', and 'cicd'. We want to make sure when we select the 'users' table, we're only getting the version from prod. Essentially we want to treat the tap as if it did not contain the test
or cicd
databases at all. It is not that we want to select or deselect those other databases, but we want to basically insulate ourselves from them and pretend they don't exist. So an ignore rule like ['test-*', 'cicd-*']
or inversely ['*', '!prod']
will have the effect of making it look like the tap only knows about data from 'prod', even if the tap itself is able to view and extract from all three databases.
That's very similar to the default behavior for excluding information_schema
- not only do we want to save time from having to scan it, but we want to pretend like those tables just don't exist at all for the purposes of our catalog construction.
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen
label, or request that it be added.
Still relevant, in so far as it relates to per-stream config (#1350).
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen
label, or request that it be added.
Feature scope
Taps (catalog, state, stream maps, etc.)
Description
This proposal would introduce a standard tap config option lik
ignored_patterns
orignored_streams
, or justignore
, which would accept glob-like input similar to.gitignore
. This would operate similar to--exclude
in Meltano as the first-order, highest priority (de)selection logic.While this technically affects "selection" and "deselection", it actually would operate differently from both, and so we should avoid conflating them in discussion.
Like (de)selection logic:
user
table or deselect it, either way it will not by synced to the target."ignore": [ "*", "!users", "!customers" ]
, that is logically equivalent to deselecting all tables except 'users' and 'customers'. (Same as.gitignore
convention.)Unlike (de)selection logic:
"ignore": [ "addresses.*", "*.*email*" ]
, then I can be 100% sure that no selection logic will later be introduced that pulls in any tables starting with "addresses*" or any columns containing the text "email". (Those physically would not be in the catalog to be selected.)"ignore": [ "information_schema-*" ]
, then my tap doesn't need to waste time analyzing any tables withininformation_schema
.A few nice things about accepting patterns and phrasing in the negative:
selection
/deselection
logic. That logic still functions exactly according to Singer Spec.ignore
logic and save time during discovery - while also reducing the size of the generated catalog artifact.When to use
ignore
instead ofselection
.Challenges or reasons not to build
The biggest challenge is that there is not an obvious parser or glob pattern for stream and property ignore rules to follow. The easiest path would be to mimic the glob expressions that Meltano uses today for
--select
and--exclude
. But escaping is always something to consider, and there may be other alternatives out there based on jsonpath or similar, which are more standards-based, even if less inherintly readable.Another challenge is that by removing streams and properties from the catalog entirely, we miss an opportunity to document what exclusions have taken place. We could mitigate this by adding some annotations within to the catalog, such as a top-level
"ignored_streams": ["stream-a", "stream-b"]
and a stream-level"ignored_properties": ["property-a", "property-b"]
.Related to:
1234