snowplow-archive / dataform-data-models

Snowplow Incubator project for Dataform SQL data models for working with Snowplow data. Supports BigQuery only
Apache License 2.0
4 stars 0 forks source link

Solve for multiple 'first sessions' with the same start_tstamp per user #10

Open bill-warner opened 3 years ago

bill-warner commented 3 years ago

Issue

In users_this_run we join together users_aggregates and users_sessions_this_run on start_tstamp. By joining on start_tstamp we attempt to pull info from the first session per user:

  FROM {{.scratch_schema}}.users_aggregates{{.entropy}} AS b

  INNER JOIN {{.scratch_schema}}.users_sessions_this_run{{.entropy}} AS a
    ON a.domain_userid = b.domain_userid
    AND a.start_tstamp = b.start_tstamp

There are rare cases however where a user can have multiple sessions with the same start_tstamp, which also happens to be their first session. This can result in duplicate domain_userids in the users_this_run table.

Proposed Fix