snowplow / dbt-snowplow-utils

Snowplow utility functions to be used in conjunction with the snowplow-web dbt package.
Other
12 stars 6 forks source link

Add buffer to derived_tstamp partitioned bigquery filtering #173

Closed agnessnowplow closed 2 months ago

agnessnowplow commented 2 months ago

Description

BigQuery users that have partitioning on derived_tstamp (snowplow__derived_tstamp_partitioned: true) need additional filtering buffer on the lower_limit when creating the base_events_this_run table when relying on the derived_timestamp as in case events are sent late (e.g dvce_created and sent tstamp differs more significantly), it can happen that the minimum and maximum limits in a certain run prevent some of the earlier sent events in a session to be reprocessed as a whole in a later run like it should causing all sorts of data issues.

What type of PR is this? (check all applicable)

Related Tickets & Documents

Checklist

Added tests?

Added to documentation?

[optional] Are there any post-deployment tasks we need to perform?

[optional] What gif best describes this PR or how it makes you feel?

agnessnowplow commented 2 months ago

Putting this back into draft as it could add potentially significant cost over time if we keep this as a default. Let's investigate alternative approaches, perhaps handling this as and when it arrives (due to its rarity) would be the better option, reprocessing events with a larger lookback window should in theory unblock users (e.g. as a one-off run with 3 days worth of lookback window (snowplow__lookback_window_hours))