mozilla / glam

Mozilla's primary interactive dashboard for examining the distribution of telemetry values.
https://glam.telemetry.mozilla.org
Mozilla Public License 2.0
20 stars 23 forks source link

`startup_timeline_framework_start` probe is missing data #1141

Closed mcomella closed 8 months ago

mcomella commented 3 years ago

The GLAM page for the startup_timeline_framework_start probe – https://glam.telemetry.mozilla.org/fenix/probe/startup_timeline_framework_start/explore?ping_type=*&ref= – displays "404: No data found for the selected dimensions." However, this probe should have data – I wrote queries for it 3 months ago.

mcomella commented 3 years ago

If it's helpful, there are several other probes that arrive in the same ping:

acmiyaguchi commented 3 years ago

Thank you for filing this! It turns out that the ETL is missing timestamps during the initial stages. I may be able to turn this on within the next week, but backfilling these aggregates is a challenge (at least on release). I also filed https://github.com/mozilla/bigquery-etl/issues/1729 in bigquery-etl.

acmiyaguchi commented 3 years ago

I've dug into this even deeper -- the primary reason that data within the startup-timeline ping is missing is because the ping does not have a client id. This causes issues with aggregation because GLAM is assumes that every metric is attached to a client id, in order to perform client-level normalization before aggregating. Rows without client-ids are currently being filtered out.

A closer look at the generated CTE Here's the initial SQL for extracting data from the Fenix tables. Note the `WHERE` predicate filters client ids. Therefore, we will always be missing metrics for the startup-timeline. ```sql -- Query generated by: python3 -m bigquery_etl.glam.clients_daily_scalar_aggregates --source-table org_mozilla_fenix_stable.startup_timeline_v1 WITH extracted AS ( SELECT *, DATE(submission_timestamp) AS submission_date, client_info.client_id, "startup-timeline" AS ping_type, COALESCE( SAFE_CAST(SPLIT(client_info.app_display_version, '.')[OFFSET(0)] AS INT64), 0 ) AS app_version, client_info.os AS os, client_info.app_build AS app_build_id, client_info.app_channel AS channel FROM `moz-fx-data-shared-prod.org_mozilla_fenix_stable.startup_timeline_v1` WHERE DATE(submission_timestamp) = @submission_date AND client_info.client_id IS NOT NULL ), ... ```

One hack would be to assign client id-less pings with a default value so they are aggregated into a single row. Another would be to assign the client-id to the document id. The better solution is to assign a random client id from a fixed set. The best solution would be to enforce a client id for consistent aggregates (although there is an argument to be made about lean data practices and not collecting a client id).

Implications of imputing a client id First would be to ensure that we'd be getting statistically correct values. We effectively perform [additive smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) on bucket densities, which helps us mix contributions across many clients. With a single client, this might change the distribution. At some point the bucket value will be [`(P1+1/K) / (nreporting+1)`, or `(P1 + 1)/2` for a bucket](https://github.com/mozilla/bigquery-etl/blob/ca3c3c87f1425cc297b46c414e7a3fa033e7f286/sql/mozfun/glam/histogram_fill_buckets_dirichlet/udf.sql#L12-L15). I think this means the distribution is evenly flattened (i.e. assuming a uniform prior). I'm getting a bit out of my depth here, but this could skew the results. In terms of ETL, we keep metrics in an array of structs, with one row per client. This would mean each day, there would be a single client with approximately 1e8 elements in an array. I'm worried that the ARRAY_AGG udfs will fail with errors in complexity. If we had a new client-id for every document, we would run into a new set of problems. We keep a table with the accumulation of metrics over the last 3 versions of a client's history. This ends up being around the size of `count(distinct client_id)` over a period of 3 months or so. If every document were a new document_id, then our accumulated table would have around 1e10 rows (~1e8 * 90 days). This would be incredibly expensive for us to scan over this table every day. For reference, 1 day of nightly scalar aggregates seems to be about 100gb from recent testing, and BigQuery is $5/TB. The middle ground would be to assign document ids into smaller random groups (say 1e6 "clients") via a hash (`crc32(document_id)%pow(10, 6)`. The table would therefore always have a fixed number of rows. The smoothing should have less of an effect since we have a sizable number of "clients" in the population. We give each randomly selected population equal representation in the final aggregate. The rows should have a reasonable array length (100-1000 elements to aggregate per day), and the history is always consistent because of stable client identifiers.

@mcomella do you have context on why the startup-timeline ping doesn't include client ids?

acmiyaguchi commented 3 years ago

I'm reading through some of the documentation, it looks like this ping is generally meant for automation, but it does look like it's being sent in quite a few pings.

mcomella commented 3 years ago

@mcomella do you have context on why the startup-timeline ping doesn't include client ids?

@acmiyaguchi Probably because I didn't know any better 😓 I can't remember exactly but I'd speculate that I didn't think I'd need it to analyze the data in this ping but thinking about it now, it's useful to answer some questions like, "Can all the outliers be caused by an extremely small proportion of users?"

it looks like this ping is generally meant for automation,

For performance, we want to be able to extract our timing in CI and in user telemetry using the same logic, to ensure they line up so this is actually intended for both use cases (though in practice we haven't added any automation for this one). Sorry if the docs are unclear.

mcomella commented 3 years ago

We discussed that this might take a little while to fix so I'll change the probe implementation to get the data faster and solve the problem with the lowest effort solution: https://github.com/mozilla-mobile/fenix/issues/17972