Feature: Accept `ignore` glob patterns as standard tap config

aaronsteers commented 1 year ago

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

This proposal would introduce a standard tap config option lik ignored_patterns or ignored_streams, or just ignore, which would accept glob-like input similar to .gitignore. This would operate similar to --exclude in Meltano as the first-order, highest priority (de)selection logic.

While this technically affects "selection" and "deselection", it actually would operate differently from both, and so we should avoid conflating them in discussion.

Like (de)selection logic:

Ignored streams and properties would not be replicated downstream.
- E.g., whether I "ignore" the user table or deselect it, either way it will not by synced to the target.
Ignore rules can also be inverted.
- E.g. If I set "ignore": [ "*", "!users", "!customers" ], that is logically equivalent to deselecting all tables except 'users' and 'customers'. (Same as .gitignore convention.)

Unlike (de)selection logic:

Ignore rules always have higher precedence over selection.
- E.g. If I set "ignore": [ "addresses.*", "*.*email*" ], then I can be 100% sure that no selection logic will later be introduced that pulls in any tables starting with "addresses*" or any columns containing the text "email". (Those physically would not be in the catalog to be selected.)
Ignore rules apply during discovery.
- E.g. If I set "ignore": [ "information_schema-*" ], then my tap doesn't need to waste time analyzing any tables within information_schema.

A few nice things about accepting patterns and phrasing in the negative:

We're not overriding or changing selection/deselection logic. That logic still functions exactly according to Singer Spec.
Catalog output could take into account the ignore logic and save time during discovery - while also reducing the size of the generated catalog artifact.

When to use ignore instead of selection.

Use Selection	Use Ignore
When you want to select or deselect specific streams or properties.	When you want to exclude large group of streams or properties from being eligible for selection. (`information_schema-,`.email_address`, `broken_table.*`)
To deselect streams and large data sources you don't need - with the understanding you can add them back easily in the future.	To systematically exclude streams you would never need or want included.
To leave record of what is deselected in the catalog.	To suppress the streams or fields from the catalog entirely.
To tell the tap to include fields that may not be included by default for that tap.	To tell the tap to ignore fields and tables it otherwise would try to scan.

Challenges or reasons not to build

The biggest challenge is that there is not an obvious parser or glob pattern for stream and property ignore rules to follow. The easiest path would be to mimic the glob expressions that Meltano uses today for --select and --exclude. But escaping is always something to consider, and there may be other alternatives out there based on jsonpath or similar, which are more standards-based, even if less inherintly readable.

Another challenge is that by removing streams and properties from the catalog entirely, we miss an opportunity to document what exclusions have taken place. We could mitigate this by adding some annotations within to the catalog, such as a top-level "ignored_streams": ["stream-a", "stream-b"] and a stream-level "ignored_properties": ["property-a", "property-b"].

Related to:

1234

kgpayne commented 1 year ago

@aaronsteers this is great 🙌 From a Meltano perspective, I think it would be preferable to maintain syntax parity between 'filter' (ignore/include) features in the SDK and those produced/expected by --select and --exclude. This will both limit the opportunity for user-error type bugs (e.g. two devs in the same Meltano Project using two approaches with two different syntaxes to try and achieve the same ultimate selection) and allow us to "push down" selection/exclude criteria directly from Meltano (possibly with env var expansion) into config.json for the Tap to apply during discovery/run.

E.g. accepting the patterns as produced from the --select/--exclude commands docs, allowing users to configure pre-discovery and post-discovery filtering in one place and in one syntax 🤯

Enabled patterns:
    tags.*
    commits.id
    commits.project_id
    commits.created_at
    commits.author_name
    commits.message
    !*.*_url

Does that make sense?

aaronsteers commented 1 year ago

@kgpayne - Agreed: if we can use the same syntax as --exclude uses in Meltano today, then it could be a passthrough to the SDK if the tap has ignore-patterns (or similar) as a capability.

The passthrough is also a more performant implementation, exactly because it short-circuits the discovery process on those streams when supported, also reducing the size of the generated catalog.

kgpayne commented 1 year ago

@aaronsteers great 👏 If we follow the Meltano select convention, I think your examples in the issue description reverse 🤔 I.e. "*" would become "select all" and !users and !customers would mean "exclude users and customers", as they are in Meltano? Just checking I am still following.

On naming, would it make sense to call these more generically filter-patterns or similar in the SDK config, as they support both include and exclude (via negation)?

aaronsteers commented 1 year ago

@aaronsteers great 👏 If we follow the Meltano select convention, I think your examples in the issue description reverse 🤔 I.e. "*" would become "select all" and !users and !customers would mean "exclude users and customers", as they are in Meltano? Just checking I am still following.

Sorry. I did not mean to suggest to use --select and --exclude, and I still believe as I mentioned in my writeup that we should make this a new ignore feature and not conflated with select/deselect.

My point was just that we can use the syntax of the rules, but applied to ignore.

kgpayne commented 1 year ago

@aaronsteers ah, ok. Thanks for clarifying.

How do you see the common ask of "limit discovery to selected streams" working with ignore? With full separation between selection and ignore pattern syntaxes (with ignore being the opposite of select), would it not be the case that a user would first have to use meltano select tap-example users "*" to select a stream and then additionally meltano config tap-example ignore "!users.*" then meltano config tap-example ignore "*" to ensure only the users stream is discovered? Repeated for each selection and remembering to negate their selection for ignore?

I agree that ignore patterns as described are different in when they apply (ignore is applied pre-discovery in the SDK, and select applies post-discovery in Meltano on catalog.json) but what the patterns refer to is the same - included or excluded streams and stream properties.

So by 'push down' I imagined that the select patterns, supported in the same format by both Meltano and the SDK, could be injected verbatim from meltano.yml select extra into the new setting in config.json to achieve the "limit discovery to selected streams" use case. This behaviour would be a feature of Meltano (behind a config extra flag in meltano (e.g. limit_discovery_to_selection) and would be a merge with any other patterns already defined in config directly, to allow users to leverage the other capabilities you mentioned. This would mean, for the common case, selection defined once (one place and in one format) then applied twice - pre and post discovery - with the option of configuring additional pre-discovery rules (including broad ignores) as needed.

Does that make sense? Maybe "limit discovery to selected streams" isn't a perfect fit for ignore?

aaronsteers commented 1 year ago

With full separation between selection and ignore pattern syntaxes (with ignore being the opposite of select), would it not be the case that a user would first have to use meltano select tap-example users "*" to select a stream and then additionally meltano config tap-example ignore "!users.*" then meltano config tap-example ignore "*" to ensure only the users stream is discovered? Repeated for each selection and remembering to negate their selection for ignore?

I don't think it's as duplicative as that - specifying the rule either in select or in exclude should be sufficient. I don't think you need anything in the select rules except '' - and that is only needed if the tap does not select its streams by default. So, the ignore rule would just be `['', '!users']- and to add one more table you'd expand it with one additional item['', '!users', '!customers']. If we're choosing here to really lock down theignorerule very strictly, a simple pairing with select of''should be fully sufficient. Again, this isn't really the main use case for ignore, sinceselectis a better match for this use case. Specifically, you'd want to useselectin this case because you want to see that 'users' and 'orders' are selected, but not 'addresses'. If you use ignore instead of select, then you lose visibility toaddresses` being present and deselected in the source.

A better-matched use case would be if we have a tap with a three-part stream name containing '<db>-<schema>-<table>', and the source has three databases: 'prod', 'test', and 'cicd'. We want to make sure when we select the 'users' table, we're only getting the version from prod. Essentially we want to treat the tap as if it did not contain the test or cicd databases at all. It is not that we want to select or deselect those other databases, but we want to basically insulate ourselves from them and pretend they don't exist. So an ignore rule like ['test-*', 'cicd-*'] or inversely ['*', '!prod'] will have the effect of making it look like the tap only knows about data from 'prod', even if the tap itself is able to view and extract from all three databases.

That's very similar to the default behavior for excluding information_schema - not only do we want to save time from having to scan it, but we want to pretend like those tables just don't exist at all for the purposes of our catalog construction.

stale[bot] commented 1 year ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

kgpayne commented 1 year ago

Still relevant, in so far as it relates to per-stream config (#1350).

stale[bot] commented 2 weeks ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

meltano / sdk