snowplow / data-models

⚠️ MAINTENANCE-ONLY MODE: Snowplow maintained SQL data models for working with Snowplow web and mobile behavioral data.
https://docs.snowplowanalytics.com/docs/modeling-your-data/
41 stars 12 forks source link

Break Ties when joining on timestamps #106

Open colmsnowplow opened 3 years ago

colmsnowplow commented 3 years ago

An edge case has come up where two things conspired to produce duplicates in the model:

  1. Some users' data have different session_ids at the same time (we think because of a race condition between tabs)
  2. Latency between device and collector results in exactly the same derived_tstamp for these events

These two factors mean that a user has two sessions with exactly the same start_tstamp, but different domain_sessionid's - and these happen to be the first tstamps for the user.

This produces duplicates in the users table when we join on start_tstamp: https://github.com/snowplow/data-models/blob/a38e76b429081cb517f91867ff0206a2f701c4a2/web/v1/redshift/sql-runner/sql/standard/04-users/01-main/06-users.sql#L86

The same issue may exist when we join on end_tstamp for aggregates.

This seems very rare, but we should introduce some means of breaking a tie in the case where derived_tstamps happen to evaluate to exactly the same thing.

event 1:
    "collector_tstamp": "2021-06-25 16:51:55.559 UTC",
    "dvce_sent_tstamp": "2021-06-25 16:51:54.919 UTC",
    "dvce_created_tstamp": "2021-06-25 16:51:54.911 UTC",
    "derived_tstamp": "2021-06-25 16:51:55.551 UTC",

event 2:
    "collector_tstamp": "2021-06-25 16:51:55.557 UTC",
    "dvce_sent_tstamp": "2021-06-25 16:51:55.077 UTC",
    "dvce_created_tstamp": "2021-06-25 16:51:55.071 UTC",
    "derived_tstamp": "2021-06-25 16:51:55.551 UTC",

(first reported on ZD ticket 27522)

awoehrl commented 3 years ago

Hi,

just to add another data point: I just ran the model on 100 days of our data and this happened with two user_ids. Both have two session ids, but exactly the same start_time.

Cheers

bill-warner commented 3 years ago

Raised the same issue on the dataform model. I have included a proposed solution there which I think we should also be able to use on the SQL runner version.

https://github.com/snowplow-incubator/dataform-data-models/issues/10