Open MeltyBot opened 3 years ago
@pnadolny13 - FYI, I updated the description above to include a proposed spec and I've add the Accepting Pull Requests
label, and I've added to our Office Hours 'Up Next' list.
Related to:
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen
label, or request that it be added.
Still relevant
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen
label, or request that it be added.
I see the 👎 in the bot's post so I'll reopen 😅 . Feel free to comment with how this would benefit your workflow or even better, some implementation ideas :)
Possible Spec
Taps can accept a new
deduplication_mode
config option (or similar), which allows a few possible options:deduplication_mode=DEFAULT
- (Default.) Same as today, records do not pass through a dedupe handler.deduplication_mode=INCREMENTAL
- All incremental streams will be deduped. This is the most efficient balance, because only 'ties' need to be tracked.deduplication_mode=FULL_TABLE
- All streams, including incremental and full table streams will be deduped. This is the most expensive option, because all record keys must be hashed. This will break many taps if the volume is large, but it could work extremely well in cases where the number of records is <10K, such as spreadsheet applications or message delivery applications.Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/162
Originally created by @aaronsteers on 2021-07-06 18:51:05
Background / Discussion
As part of the Singer spec, and also as a best practice to avoid records being skipped, taps are implemented with a
greater-than-or-equal-to
comparison against the replication key value.The "or equal to" part of the comparison is confusing to new users but the reason for it is important: to ensure record "ties" are never omitted from the final target.
Existing mitigations
The current solution for this problem is (1) use primary key upserts on the target side, which will naturally solve the duplication problem or (2) use a solution like dbt to remove duplicates downstream.
Proposed improvement
record_hashes_seen
property (or similar) as a list of record hashes having the same value as the max replication key value.record_hashes_seen
, we can omit that record and not send it downstream.record_hashes_seen
. This would not be a cumulative list of all records, only the latest ties by replication key.Since the SDK entirely handles the State implementation for SDK-based taps, we have an opportunity to build this as a more robust solution across all taps using the SDK.
Best reasons not to build
The reasons not to build this are (1) performance, (2) complexity, and (3) scalability.
The complexity argument can be mitigated by the fact that these are all aspects managed entirely in the SDK, so developers and users can in general completely ignore these internal state treatments.
Regarding performance, the hashing of a record should be able to be performed very quickly, and then it is just a matter of tuning the caching and variable comparison logic. This should be tunable to reach satisfactory performance, but if not, we could also mitigate by enabling as an optional setting, such as
dedupe_incremental_streams
(bool), or similar.Regarding scalability, this solution should scale fine assuming a small number "ties" by replication key value. If the number of ties is in the thousands or larger, however, this could have adverse affects on the stability of STATE messages. An option to disable the behavior via settings (as described in the previous paragraph, could prove useful for this as well. That said, presumably the larger the number of ties (within reason), the higher the value of this deduplication capability.
Regarding adherance to Singer Spec
To my knowledge, this implementation would still adhere to the spec since (1) the STATE behavior is entirely up to the tap to control, (2) we still accomplish
>=
logic for replication keys, (3) we still guarantee that every record will be sent at least once.