meltano / sdk

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com
https://sdk.meltano.com
Apache License 2.0
94 stars 68 forks source link

Option in SQL Targets to coerce types based on observed record shape #1561

Open aaronsteers opened 1 year ago

aaronsteers commented 1 year ago

This would be a simple-to-use option for end users who are trying to deal with "rogue" or incorrect type declarations in the upstream tap.

For instance, if the tap incorrectly defines one of its fields as an integer, but it receives a string, then we could give the user an option of auto-expanding the data type to be inclusive of the declared type and also the observed type. Since a string column can hold integers as well as strings, expanding the data type to a string type will allow the load to complete successfully.

Implementation wise, if built within the tap or mapper layer, this normally would result in a new SCHEMA message being emitted upon observance of a record that does not fit the declared schema. However, if built in the target, there's no need to emit a SCHEMA message. Instead, the Sink class per batch would expand data type negotiation to be inclusive of (1) declared type, (2) target column's already existing type, and (2) observed data type in the records. Currently this negotiation exists but it only considers the first two factors.

Note:

cc @radbrt

radbrt commented 1 year ago

From what I can tell, the SDK does type validation by default. An input stream where integer columns contain a string, and corresponding error message:

{"type": "SCHEMA", "stream": "badges2", "schema": {"type": "object", "properties": {"id": {"type": ["integer"]}, "user_id": {"type": ["null", "string"], "maxLength": 64}, "name": {"type": ["null", "string"], "maxLength": 64}, "class": {"type": ["null", "integer"]}, "tag_based": {"type": ["null", "boolean"]}}}, "key_properties": ["id"], "bookmark_properties": ["id"]}
{"type": "STATE", "value": {"bookmarks": {"badges2": {"last_replication_method": "INCREMENTAL", "replication_key": "id", "version": 1667747583387}}, "currently_syncing": "badges"}}
{"type": "ACTIVATE_VERSION", "stream": "badges2", "version": 1667747583387}
{"type": "RECORD", "stream": "badges2", "record": {"class": 2, "id": 1, "name": "e9dd95f26c8b24ab973d3f8f", "tag_based": null, "user_id": "9C"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "RECORD", "stream": "badges2", "record": {"class": "2", "id": 2, "name": "984fa1edcd96cc55224124c0", "tag_based": null, "user_id": "40"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "RECORD", "stream": "badges2", "record": {"class": "D", "id": 3, "name": "cc3453e38b2371bffcae9aa0", "tag_based": null, "user_id": "33"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "STATE", "value": {"bookmarks": {"badges2": {"last_replication_method": "INCREMENTAL", "replication_key": "id", "version": 1667747583387, "replication_key_value": 3}}, "currently_syncing": null}}

The error message:

2023-03-29 19:23:39,982 Target 'target-oracle' is listening for input from tap.
2023-03-29 19:23:39,982 Initializing 'target-oracle' target sink...
2023-03-29 19:23:39,983 Initializing target sink for stream 'badges2'...
Traceback (most recent call last):
  File "/Users/radbrt/miniconda3/envs/target-oracle/bin/target-oracle", line 8, in <module>
    sys.exit(TargetOracle.cli())
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 564, in cli
    target.listen(file_input)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/io_base.py", line 34, in listen
    self._process_lines(file_input)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 265, in _process_lines
    counter = super()._process_lines(file_input)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/io_base.py", line 81, in _process_lines
    self._process_record_message(line_dict)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 308, in _process_record_message
    sink._validate_and_parse(transformed_record)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/sinks/core.py", line 302, in _validate_and_parse
    self._validator.validate(record)
  File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/jsonschema/validators.py", line 310, in validate
    raise error
jsonschema.exceptions.ValidationError: '2' is not of type 'null', 'integer'

Failed validating 'type' in schema['properties']['class']:
    {'type': ['null', 'integer']}

On instance['class']:
    '2'

I hope to add functionality, and a setting tentatively named types_dont_fail_me_now: true, that allows the target to alter the column type of the offending column.

The pragmatic approach would be to alter the column to varchar (possibly via add column/drop column), and coerce the input, but we could conceivably make a more complex ruleset of int -> bigint -> number -> varchar -> clob.

The other half of the solution must be that during the validation of the target schema (typically during subsequent runs), so that when it finds varchar in the target column it will accept the target type and keep coercing the integer (or whatever) to varchar.

Type checking brings a performance hit, and many users might prefer if the target fails when something is wrong (so it must be a configurable setting). And this would not be feasible to do during batch loads.

BuzzCutNorman commented 1 year ago

I have run into the JSON Validation errors with a couple of data type. @edgarrmondragon tracked down the root cause of the issue and created a draft PR that contains a possible way for developers to control the JSON Validation settings. Here is the link to the PR just in case it is helpful for this conversation.

feat: Support custom JSON schema validation and string format checkers in targets

1471

stale[bot] commented 1 year ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

tayloramurphy commented 1 year ago

Still relevant right @edgarrmondragon ?

edgarrmondragon commented 1 year ago

@tayloramurphy yeah this'd be nice to have