Open colmsnowplow opened 3 years ago
Hi,
just to add another data point: I just ran the model on 100 days of our data and this happened with two user_ids. Both have two session ids, but exactly the same start_time.
Cheers
Raised the same issue on the dataform model. I have included a proposed solution there which I think we should also be able to use on the SQL runner version.
https://github.com/snowplow-incubator/dataform-data-models/issues/10
An edge case has come up where two things conspired to produce duplicates in the model:
These two factors mean that a user has two sessions with exactly the same start_tstamp, but different domain_sessionid's - and these happen to be the first tstamps for the user.
This produces duplicates in the users table when we join on start_tstamp: https://github.com/snowplow/data-models/blob/a38e76b429081cb517f91867ff0206a2f701c4a2/web/v1/redshift/sql-runner/sql/standard/04-users/01-main/06-users.sql#L86
The same issue may exist when we join on end_tstamp for aggregates.
This seems very rare, but we should introduce some means of breaking a tie in the case where derived_tstamps happen to evaluate to exactly the same thing.
(first reported on ZD ticket 27522)