In users_this_run we join together users_aggregates and users_sessions_this_run on start_tstamp. By joining on start_tstamp we attempt to pull info from the first session per user:
FROM {{.scratch_schema}}.users_aggregates{{.entropy}} AS b
INNER JOIN {{.scratch_schema}}.users_sessions_this_run{{.entropy}} AS a
ON a.domain_userid = b.domain_userid
AND a.start_tstamp = b.start_tstamp
There are rare cases however where a user can have multiple sessions with the same start_tstamp, which also happens to be their first session. This can result in duplicate domain_userids in the users_this_run table.
Proposed Fix
In 03-users-sessions-this-run step, join in start_tstamp from users_userids_this_run giving user_start_tstamp.
In 04-users-aggs step, add first_domain_sessionid. Use some arbitary selection if multiple first sessions with same start_tstamp i.e. max(case when start_tstamp = user_start_tstamp then domain_sessionid end) as first_domain_sessionid
In 06-users step, join a to b using first_domain_sessionid = domain_session_id, rather than start_tstamp to get first session details.
Issue
In
users_this_run
we join togetherusers_aggregates
andusers_sessions_this_run
onstart_tstamp
. By joining onstart_tstamp
we attempt to pull info from the first session per user:There are rare cases however where a user can have multiple sessions with the same
start_tstamp
, which also happens to be their first session. This can result in duplicatedomain_userid
s in theusers_this_run
table.Proposed Fix
03-users-sessions-this-run
step, join instart_tstamp
fromusers_userids_this_run
givinguser_start_tstamp
.04-users-aggs
step, addfirst_domain_sessionid
. Use some arbitary selection if multiple first sessions with samestart_tstamp
i.e.max(case when start_tstamp = user_start_tstamp then domain_sessionid end) as first_domain_sessionid
06-users
step, join a to b usingfirst_domain_sessionid
=domain_session_id
, rather thanstart_tstamp
to get first session details.