meltano / sdk

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com
https://sdk.meltano.com
Apache License 2.0
87 stars 63 forks source link

feat: check for and enforce RFC3339 `start_date` input format #1792

Open kgpayne opened 1 year ago

kgpayne commented 1 year ago

Discussed in https://github.com/meltano/sdk/discussions/855

Originally posted by **aaronsteers** July 23, 2022 I read "should" as a hard requirement and I have some strong opinions as of now about allowing integer datetime values for the standard `start_date` config input. I did not know we had any cases in the wild already where start_date was epoch/unix timestamp based and my preference is to unwind these or implement the custom fixes/overrides there in those taps. Again, because `start_date` format is actually dictated by the spec (according to my reading, at least), orchestrators and users would reasonably expect to be able to send a consistent stringified date value to all taps in a project. If we diverge on this point and encourage tap developers to diverge, then we break assumptions in a way that will create silent failures and buggy behavior for our end users. For instance, trying to do an alpha-based greater-than-or-less-than comparison between a stringified integer and a stringified datetime will not fail but it will more likely just produce a wrong result. Many non-SDK taps which we cannot control are simply running a naive string comparison, which would work for the suggested standard and fail silently when mixed and matched with an epoch/integer format. I'm not going to say tap developers can't go against the standard, but I think the SDK should keep with the safe prescriptive guidance here, along a path that provides the most consistent experience for users. _Sidebar: in a past life my own team's CI/CD environments calculated dynamic date values of "yesterday" or "48 hours ago" for faster EL tests and we sent the same input to _all_ taps in our project. I also have return a stack overflow answer prescribing the same and giving sample code to create an environment variable containing the correct string format using Linux or Mac. In theory, this is 100% safe according to spec but our solution would have broken for any cases where the expected input format differed across taps._ I think there's value in preserving ability to have a "project-wide" and Singer-wide filtering capability using a consistent convention - again for users and also for orchestrators. Last point, there may be intuitions to tightly couple the tap config interface to the upstream API's behavior and I would strongly discourage us from that approach. For one, it implies that the API behaviors are monolithic, where in fact each stream in a tap (or even future-added streams to the tap) is allowed to have a completely different internal data type expectation. By keeping with a standard recommendation on the config/UX side, we keep the external user interface the same regardless of internals of the upstream APIs. Secondly, the tap users (and downstream targets!) should be reasonably isolated from internal storage preferences of the distinct quirks and idioms of every upstream API. If something should arrive and be stored as a datetime, we should send it to the target as such. Lastly, to be clear, I'm not saying the standard stringified date format is better or more preferable to epoch or Unix time - but given that the choice is already made, and given there's a high cost of supporting different taps on different standards, I do not think we should accommodate or support a divergence in this area.
kgpayne commented 1 year ago

Related to #759 and #1734

edgarrmondragon commented 5 months ago

Might be indirectly addressed by #2132