Open aaronsteers opened 1 year ago
From what I can tell, the SDK does type validation by default. An input stream where integer columns contain a string, and corresponding error message:
{"type": "SCHEMA", "stream": "badges2", "schema": {"type": "object", "properties": {"id": {"type": ["integer"]}, "user_id": {"type": ["null", "string"], "maxLength": 64}, "name": {"type": ["null", "string"], "maxLength": 64}, "class": {"type": ["null", "integer"]}, "tag_based": {"type": ["null", "boolean"]}}}, "key_properties": ["id"], "bookmark_properties": ["id"]}
{"type": "STATE", "value": {"bookmarks": {"badges2": {"last_replication_method": "INCREMENTAL", "replication_key": "id", "version": 1667747583387}}, "currently_syncing": "badges"}}
{"type": "ACTIVATE_VERSION", "stream": "badges2", "version": 1667747583387}
{"type": "RECORD", "stream": "badges2", "record": {"class": 2, "id": 1, "name": "e9dd95f26c8b24ab973d3f8f", "tag_based": null, "user_id": "9C"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "RECORD", "stream": "badges2", "record": {"class": "2", "id": 2, "name": "984fa1edcd96cc55224124c0", "tag_based": null, "user_id": "40"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "RECORD", "stream": "badges2", "record": {"class": "D", "id": 3, "name": "cc3453e38b2371bffcae9aa0", "tag_based": null, "user_id": "33"}, "version": 1667747583387, "time_extracted": "2022-11-06T15:13:03.387853Z"}
{"type": "STATE", "value": {"bookmarks": {"badges2": {"last_replication_method": "INCREMENTAL", "replication_key": "id", "version": 1667747583387, "replication_key_value": 3}}, "currently_syncing": null}}
The error message:
2023-03-29 19:23:39,982 Target 'target-oracle' is listening for input from tap.
2023-03-29 19:23:39,982 Initializing 'target-oracle' target sink...
2023-03-29 19:23:39,983 Initializing target sink for stream 'badges2'...
Traceback (most recent call last):
File "/Users/radbrt/miniconda3/envs/target-oracle/bin/target-oracle", line 8, in <module>
sys.exit(TargetOracle.cli())
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 564, in cli
target.listen(file_input)
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/io_base.py", line 34, in listen
self._process_lines(file_input)
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 265, in _process_lines
counter = super()._process_lines(file_input)
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/io_base.py", line 81, in _process_lines
self._process_record_message(line_dict)
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/target_base.py", line 308, in _process_record_message
sink._validate_and_parse(transformed_record)
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/singer_sdk/sinks/core.py", line 302, in _validate_and_parse
self._validator.validate(record)
File "/Users/radbrt/miniconda3/envs/target-oracle/lib/python3.10/site-packages/jsonschema/validators.py", line 310, in validate
raise error
jsonschema.exceptions.ValidationError: '2' is not of type 'null', 'integer'
Failed validating 'type' in schema['properties']['class']:
{'type': ['null', 'integer']}
On instance['class']:
'2'
I hope to add functionality, and a setting tentatively named types_dont_fail_me_now: true
, that allows the target to alter the column type of the offending column.
The pragmatic approach would be to alter the column to varchar (possibly via add column/drop column), and coerce the input, but we could conceivably make a more complex ruleset of int -> bigint -> number -> varchar -> clob.
The other half of the solution must be that during the validation of the target schema (typically during subsequent runs), so that when it finds varchar in the target column it will accept the target type and keep coercing the integer (or whatever) to varchar.
Type checking brings a performance hit, and many users might prefer if the target fails when something is wrong (so it must be a configurable setting). And this would not be feasible to do during batch loads.
I have run into the JSON Validation errors with a couple of data type. @edgarrmondragon tracked down the root cause of the issue and created a draft PR that contains a possible way for developers to control the JSON Validation settings. Here is the link to the PR just in case it is helpful for this conversation.
feat: Support custom JSON schema validation and string format checkers in targets
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen
label, or request that it be added.
Still relevant right @edgarrmondragon ?
@tayloramurphy yeah this'd be nice to have
This would be a simple-to-use option for end users who are trying to deal with "rogue" or incorrect type declarations in the upstream tap.
For instance, if the tap incorrectly defines one of its fields as an integer, but it receives a string, then we could give the user an option of auto-expanding the data type to be inclusive of the declared type and also the observed type. Since a string column can hold integers as well as strings, expanding the data type to a string type will allow the load to complete successfully.
Implementation wise, if built within the tap or mapper layer, this normally would result in a new
SCHEMA
message being emitted upon observance of a record that does not fit the declared schema. However, if built in the target, there's no need to emit a SCHEMA message. Instead, theSink
class per batch would expand data type negotiation to be inclusive of (1) declared type, (2) target column's already existing type, and (2) observed data type in the records. Currently this negotiation exists but it only considers the first two factors.Note:
cc @radbrt