snowplow / snowplow-javascript-tracker

Snowplow event tracker for client-side and server-side JavaScript. Add analytics to your websites, web apps and servers.
http://snowplowanalytics.com
BSD 3-Clause "New" or "Revised" License
543 stars 220 forks source link

Investigate situations where the domain_sessionid changes but the combination of domain_sessionidx and domain_userid does not #451

Open yalisassoon opened 8 years ago

yalisassoon commented 8 years ago

We've seen this happen on a number of different data (each time occasionally). It would be good to understand why.

bogaert commented 8 years ago

I ran a first test:

WITH duid_blacklist AS ( -- blacklist domain_userid with duplicates
  SELECT DISTINCT domain_userid FROM atomic.events
  WHERE event_id IN (SELECT event_id FROM (SELECT event_id, COUNT(*) FROM atomic.events WHERE v_tracker LIKE 'js-2.6.0%' GROUP BY 1) WHERE count > 1)
)

SELECT comparison, COUNT(*) FROM (
  SELECT
    domain_userid,
    COUNT(DISTINCT domain_sessionidx) AS idx,
    COUNT(DISTINCT domain_sessionid) AS id,
    COUNT(DISTINCT domain_sessionidx) = COUNT(DISTINCT domain_sessionid) AS comparison
  FROM atomic.events
  WHERE v_tracker LIKE 'js-2.6.0%' -- restrict to a single tracker
  AND domain_userid NOT IN (SELECT domain_userid FROM duid_blacklist)
  GROUP BY 1
) GROUP BY 1;

First results show a difference between domain_sessionidx and domain_sessionid in about 0.1-0.3% of sessions.

bogaert commented 8 years ago

I updated the SQL:


WITH duid_blacklist AS ( -- blacklist domain_userid with duplicates
  SELECT DISTINCT domain_userid FROM atomic.events
  WHERE event_id IN (SELECT event_id FROM (SELECT event_id, COUNT(*) FROM atomic.events WHERE v_tracker LIKE 'js-2.6.0%' GROUP BY 1) WHERE count > 1)
)

SELECT c1, c2, COUNT(*) FROM (
  SELECT
    domain_userid,
    COUNT(DISTINCT domain_sessionidx) AS idx,
    COUNT(DISTINCT domain_sessionid) AS id,
    COUNT(DISTINCT domain_sessionidx) > COUNT(DISTINCT domain_sessionid) AS c1,
    COUNT(DISTINCT domain_sessionidx) < COUNT(DISTINCT domain_sessionid) AS c2
  FROM atomic.events
  WHERE v_tracker LIKE 'js-2.6.0%' -- restrict to a single tracker
  AND domain_userid NOT IN (SELECT domain_userid FROM duid_blacklist)
  GROUP BY 1
) GROUP BY 1,2;

The most common case appears to be domain_sessionid changes even though domain_sessionidx doesn't increase. The reverse also happens but not as often (2-3 orders of magnitude less frequent).

bogaert commented 8 years ago

A last note: it's sometimes more frequent than the 0.1-0.3% mentioned before (one dataset had 1-3% of sessions affected).

fblundun commented 8 years ago

One possible way this can happen even when cookies appear to be enabled:

A user navigates to the page They open another tab in the same domain They disable cookies (this isn't reported by the tracker due to #468) They leave for a while They come back to the first tab, starting a new session and incrementing that tab's session count They go to the second tab, starting a new session in that tab too. The tabs will have the same session count but different session IDs (since they cannot communicate by cookies).

This does require a user to turn off 1st-party cookies whilst browsing the site, which does seem too unlikely to account for 1-3% of sessions.

fblundun commented 8 years ago

Pushing back...

yalisassoon commented 8 years ago

Some interesting additional info from one of our users here: https://snowplow.zendesk.com/agent/tickets/2436

They found cases where the domain_sessionidx would suddenly go down e.g. from 30 to 10. When this happens the domain_sessionid changes (so it registers a new sesssion) but it you don't get a new unique domain_userid || domain_sessionidx combination.

@bogaert was going to see if we see a similar pattern in other Snowplow data sets.

bogaert commented 8 years ago

We have seen this happen before.

chuwy commented 7 years ago

Pushing back.