singer-io / tap-adwords

GNU Affero General Public License v3.0
29 stars 37 forks source link

Call to Action: Missing Static Schema Fields #49

Open dmosorast opened 5 years ago

dmosorast commented 5 years ago

TLDR; The static schemas in this tap have aged, and seem to be missing a lot of (potentially important data). If you are missing data, this is how you can verify your change is working before submitting a PR.

The Issue

Singer-python logs fields not present in the schema as they are transformed in this sort of fashion:

2019-06-21 21:01:14,459Z    tap - WARNING Removed 10 paths during transforms:
2019-06-21 21:01:14,459Z    tap -       advertisingChannelSubType
2019-06-21 21:01:14,459Z    tap -       biddingStrategyConfiguration
2019-06-21 21:01:14,460Z    tap -       budget
2019-06-21 21:01:14,460Z    tap -       campaignGroupId
2019-06-21 21:01:14,460Z    tap -       finalUrlSuffix
2019-06-21 21:01:14,460Z    tap -       forwardCompatibilityMap
2019-06-21 21:01:14,460Z    tap -       selectiveOptimization
2019-06-21 21:01:14,460Z    tap -       trackingUrlTemplate
2019-06-21 21:01:14,461Z    tap - WARNING Removed paths list: ['advertisingChannelSubType', 'biddingStrategyConfiguration', 'budget', 'campaignGroupId', 'finalUrlSuffix', 'forwardCompatibilityMap', 'selectiveOptimization', 'trackingUrlTemplate']

This is done to ensure the quality of the data from the source by requiring an explicit data typing to be specified in the schema. Data typing changes are challenging for ETL, since the destination would need to be backfilled with historical data that is typed correctly.

The Ask

It's challenging to find representative examples for every field that can come through this tap, due to the nature of ads. So, if you see fields removed in this fashion and want to get the data, please submit a PR with the information below included, so that we can verify the correctness and successfully merge them.

The Process

A schema change is deceptively simple (given that it's just a JSON schema), but this can cause real problems down the line if not verified correctly. Including this information with a PR helps us verify that the data is correct using the tools Singer has to offer.

  1. Make the change to the schema and its metadata (the AdWords documentation on the service being modified can be a good guide for a rough start). Here is an example change set.
  2. Run the tap using target-stitch in verify mode, without a state file. (e.g., /path/to/env/tap-adwords --config config.json --catalog catalog.json | /path/to/env/target-stitch --config config.json --dry-run -o target-output.json).
    • The --dry-run flag will add local schema validation while not posting to Stitch and -o will save the records locally to the specified file for manual inspection.
    • Not using a state file will ensure that enough data has passed through to catch any issues with inconsistent data types.
    • To select the stream you want, you can use this tool with the output of /path/to/env/tap-adwords --config config.json --discover
  3. Open a PR with the Schema change on this repo with the following information:
    • A log snippet (~20 lines) with a "Removed paths" log line with and without the field removed
    • A log snippet with a "Batch is valid" message from the target (e.g., 2019-06-21 21:01:13,527Z target - INFO ad_groups (2345): Batch is valid
    • (Optional) A redacted RECORD message from the output file containing non-null data for the intended field(s). For information security's sake, do not post clear data or PII on Github. Adwords will frequently return null in the event of a lack of data, so seeing actual data come through is a HUGE help.

This is a new idea, so the requirements might change as we learn what is needed to ensure correctness, but this should be a good start. Thank you for helping us improve SInger

auyer commented 4 years ago

Hi @dmosorast , I noticed the help wanted flag in this issue. Maybe you could add a hacktoberfest tag also to attract more contributors !