meltano / sdk

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com
https://sdk.meltano.com
Apache License 2.0
94 stars 68 forks source link

bug: String `format` is not validated in configuration #1451

Open edgarrmondragon opened 1 year ago

edgarrmondragon commented 1 year ago

Singer SDK Version

0.21.0

Python Version

3.10

Bug scope

Configuration (settings parsing, validation, etc.)

Operating System

Any

Description

A config schema like

# samples/config_format/tap.py

from singer_sdk import Tap

class MyTap(Tap):
    name = "my-tap"
    config_jsonschema = {
        "type": "object",
        "properties": {
            "start_date": {
                "type": "string",
                "format": "date-time",
                "description": "The date and time to start syncing data from.",
            },
        },
    }

    def discover_streams(self):
        return []

if __name__ == "__main__":
    MyTap().cli()

doesn't result in a validation error when passed a string in an incorrect format

$ MY_TAP_START_DATE='notadate' poetry run python samples/config_format/tap.py --config ENV
2023-02-22 18:28:02,960 | INFO     | my-tap               | my-tap v[could not be detected], Meltano SDK v0.21.0
2023-02-22 18:28:02,960 | INFO     | my-tap               | Parsing env var for settings config...
2023-02-22 18:28:02,961 | INFO     | singer_sdk.configuration._dict_config | Parsing 'start_date' config from env variable 'MY_TAP_START_DATE'.
2023-02-22 18:28:02,961 | INFO     | my-tap               | Config validation passed with 0 warnings.

That means that the bad string is caught too late, when the start_date needs to be parsed for incremental replication.

Backwards compatibility

We could just enable format checks by updating the validator instance in

https://github.com/meltano/sdk/blob/774a17664a139b966de5c80ed69aab2c161e2752/singer_sdk/plugin_base.py#L237-L238

The problem with doing that, even if we don't add the format extras, is that we'd break for example any users passing 2021-01-01 as an instance of date-time.

One option would be to make singer_sdk.typing.DateTimeType be dumped as a union:

{
  "anyOf": [
    {"type": "string", "format": "date-time"},
    {"type": "string", "format": "date"}
  ]
}

Code

No response

aaronsteers commented 1 year ago

@edgarrmondragon - I hand't really thought through this before you logged here. Regarding failing early on config inputs, I'm 100% in agreement that this makes sense. A few caveats below on implementation challenges and backwards compatibility.

  1. I confirmed that date-time format fails validation of date-only values here: https://www.jsonschemavalidator.net/s/NpWURgml
  2. That said, many/most targets can (and do) ingest date-only values into date-time typed properties and nodes. Some simple targets may also have difficult parsing anyOf when constructing the target table.
  3. I'd be inclined to add a new DateOrDateTypeType type for use in applications like the canonical start_date, possibly defined using a new AnyOfType helper, as in Property("start_date", type=AnyOfType(DateType, DataTypeType, required=false).
    • This approach has a benefit of adding the desired "config validation" behavior (with convenient shorthand), while not breaking any sync behaviors that may be working fine today.
  4. I'm also in favor of always validating format constraints of config, just not for record data.

Regarding config validation versus record

If a tap is upgraded to the latest version of the SDK, the added check could cause certain config inputs to fail - but this would happen presumably at a time when the user is upgrading the tap or target on their side. Whether expecting to have to retest or not, a user could update 2022-01-01 to 2022-01-01T00:00:00 or similar.

Because our config declarations do not need to be instantiated by the target as columns, I don't worry about anyOf being handled by targets.

Re: start_date config specifically

If we're focusing just about the start_date config input, I think the best path is to allow either format; I think many Singer getting started guides even give samples with date-only sample values - and it would not be good for these examples to fail.

edgarrmondragon commented 1 year ago

2. That said, many/most targets can (and do) ingest date-only values into date-time typed properties and nodes. Some simple targets may also have difficult parsing anyOf when constructing the target table.

@aaronsteers Agreed. Dumping DateTimeType as an anyOf would be problematic for most targets.


I'm also in favor of always validating format constraints of config, just not for record data.

Me too, but my only concern with that would be packages that neither pin a specific SDK version, nor have an upper bound set to a existing release. We also don't enforce that best practice in our templates:

https://github.com/meltano/sdk/blob/d3588829a92b417af7f44e800787a508a75000c7/cookiecutter/tap-template/%7B%7Bcookiecutter.tap_id%7D%7D/pyproject.toml#L24

(might be good to revisit https://gitlab.com/meltano/sdk/-/merge_requests/208)

In those cases, a simple meltano install could break taps at the config validation stage in the next sync.


If we're focusing just about the start_date config input, I think the best path is to allow either format; I think many Singer getting started guides even give samples with date-only sample values - and it would not be good for these examples to fail.

Yeah, that's what motivated my anyOf proposal for DateTimeType above but I had not considered that would cause issues when used in stream schemas. Do you think allowing both formats without forcing users to update to Property("start_date", type=AnyOfType(DateType, DataTypeType, required=false) is a requirement?

edgarrmondragon commented 1 year ago

See https://github.com/meltano/sdk/issues/1457 for a related issue in record data validation. It seems like we're already validating format:

https://github.com/meltano/sdk/blob/253851e005a92549684959f4f6150abc0617264b/singer_sdk/sinks/core.py#L83

stale[bot] commented 1 year ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

edgarrmondragon commented 1 year ago

Still relevant

stale[bot] commented 1 month ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.