mozilla / bigquery-etl

Bigquery ETL
https://mozilla.github.io/bigquery-etl
Mozilla Public License 2.0
251 stars 98 forks source link

Apparent inconsistency between GUD and clients_last_seen #327

Closed jmccrosky closed 4 years ago

jmccrosky commented 5 years ago

It seems that GUD disagrees with clients_last_seen in the number of new desktop profiles created on 2019-8-1 (and probably other dates).

I get 805522 from:

select
  COUNT(DISTINCT client_id)
from
  telemetry.clients_last_seen
where
  CAST(SAFE.LOG(days_created_profile_bits & -days_created_profile_bits, 2) AS INT64) = 6
  AND submission_date = DATE "2019-08-01"

But I get 852688 from GUD here.

It's entirely possible I'm making some mistake with my query, but was hoping someone could help investigate. Thanks!

jklukas commented 5 years ago

Looks like this is my mistake. I believe your 805522 number is what GUD should be displaying, but there's a logic error in https://github.com/mozilla/bigquery-etl/blob/99fe0dfd9e38b98edaad63f0ce3cf62b496bd09c/sql/telemetry/smoot_usage_desktop_raw_v1/query.sql#L72-L74

So what we're currently showing for "New Firefox Desktop Profile Created" is the number of clients who report a ping on the given day that lists that same day as the profile created date. We should instead be reporting this metric on a 6-day delay.

The same logic applies to nondesktop and fxa data, so we should make changes there too. Although, the difference may be significantly less pronounced there since I'm not aware we see the same desktop effect of spurious profile creation dates.

This will require a bit of reconfiguration in the queries and we'll need to do a full reprocessing of the smootusage* tables. Luckily, no change is needed for clients_last_seen, which would be much more expensive to recompute.

@jmccrosky - Any concerns with moving forward with this as soon as is convenient? Do you think this warrants communication to current users of GUD to let them know about the change?

jmccrosky commented 5 years ago

Actually, I think there's a problem with the numbers from that table. Given the expected "weekend effect", they seem to be off by a day. I confirmed that the expected behaviour is a drop on Saturdays and Sundays using my prototype (https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/170847/).

Can someone look for a date "off by one" error in the ETL that generates new profile data for clients_last_seen?

jklukas commented 5 years ago

I see the expected weekend effect if I look at counts where (days_since_created_profile = 0):

SELECT
  submission_date,
  EXTRACT(DAYOFWEEK
  FROM
    submission_date) IN (1,
    7) AS is_weekend,
  COUNT(*) AS c
FROM
  `moz-fx-data-derived-datasets.telemetry.clients_last_seen_v1`
WHERE
  days_since_created_profile = 0
  AND submission_date >= '2019-08-01'
GROUP BY
  1
ORDER BY
  1
Row submission_date is_weekend c  
1 2019-08-01 false 852688  
2 2019-08-02 false 773493  
3 2019-08-03 true 534910  
4 2019-08-04 true 463680  
5 2019-08-05 false 851325  
6 2019-08-06 false 862108  
7 2019-08-07 false 851597  
8 2019-08-08 false 839969  
9 2019-08-09 false 776402  
10 2019-08-10 true 544727  
11 2019-08-11 true 455472  
12 2019-08-12 false 826935  
13 2019-08-13 false 870094  
14 2019-08-14 false 872843  
15 2019-08-15 false 812455  
16 2019-08-16 false 828779  
17 2019-08-17 true 569093  
18 2019-08-18 true 501485  
19 2019-08-19 false 972351  
20 2019-08-20 false 982604  
21 2019-08-21 false 973400  
22 2019-08-22 false 950983  
23 2019-08-23 false 866536  
24 2019-08-24 true 555759  
25 2019-08-25 true 495164  
26 2019-08-26 false 1018870  
27 2019-08-27 false 1037675  
28 2019-08-28 false 1018632  
29 2019-08-29 false 1005000  
30 2019-08-30 false 920210  
31 2019-08-31 true 574771  
32 2019-09-01 true 517167  
33 2019-09-02 false 940369  
34 2019-09-03 false 1126128  
35 2019-09-04 false 1063404
jklukas commented 5 years ago

Similarly, if I use days_since_created_profile = 6 and shift backwards by 6 days, I get the expected weekend effect:

WITH
  shifted AS (
  SELECT
    DATE_SUB(submission_date, INTERVAL 6 day) AS submission_date
  FROM
    `moz-fx-data-derived-datasets.telemetry.clients_last_seen_v1`
  WHERE
    days_since_created_profile = 6 )
SELECT
  submission_date,
  EXTRACT(DAYOFWEEK
  FROM
    submission_date) IN (1,
    7) AS is_weekend,
  COUNT(*) AS c
FROM
  shifted
WHERE
  submission_date >= '2019-08-01'
GROUP BY
  1
ORDER BY
  1
Row submission_date is_weekend c  
1 2019-08-01 false 985825  
2 2019-08-02 false 892575  
3 2019-08-03 true 617568  
4 2019-08-04 true 533760  
5 2019-08-05 false 987805  
6 2019-08-06 false 995771  
7 2019-08-07 false 983021  
8 2019-08-08 false 965753  
9 2019-08-09 false 892578  
10 2019-08-10 true 643562  
11 2019-08-11 true 520666  
12 2019-08-12 false 952138  
13 2019-08-13 false 1001592  
14 2019-08-14 false 1003012  
15 2019-08-15 false 934988  
16 2019-08-16 false 952240  
17 2019-08-17 true 653248  
18 2019-08-18 true 575312  
19 2019-08-19 false 1125524  
20 2019-08-20 false 1132724  
21 2019-08-21 false 1121855  
22 2019-08-22 false 1093120  
23 2019-08-23 false 993706  
24 2019-08-24 true 639000  
25 2019-08-25 true 568050  
26 2019-08-26 false 1175914  
27 2019-08-27 false 1190309  
28 2019-08-28 false 1166838  
29 2019-08-29 false 1150169
jklukas commented 5 years ago

So I think the day of week shifting problem is likely that you're looking at days_since_created_profile = 6, but you aren't shifting back to have the displayed date represent the day on which the profile was created.