mozilla / glam

Mozilla's primary interactive dashboard for examining the distribution of telemetry values.
https://glam.telemetry.mozilla.org
Mozilla Public License 2.0
20 stars 23 forks source link

FoG client count is much lower than expected #2003

Closed Iinh closed 1 year ago

Iinh commented 2 years ago

The client count for FOG metrics does not look right.

For example:

Screen Shot 2022-06-01 at 1 49 33 PM

On May 30, nightly channel, wr.rasterize_glyphs_time only recorded 63 clients. This number is way lower than what we would expect, which should be in the 20k - 30k range as seen in this query.

The telemetry counterpart for this metric, wr_rasterize_glyphs_time, records ~10k clients. As confirmed by @chutten, that while it's expected that metrics ping DAU to be lower, it should not be that much lower (~10k vs 63)

DAU by "metrics" ping is, on release, expected to be 12-20% lower than DAU by "main" ping. DAU by "baseline" ping is, on release, expected to be higher (except on Mac, especially on weekends) than DAU by "main" ping

Further investigation needed, atm I suspect that something is off with the FOG etl.

edugfilho commented 2 years ago

It looks like here is where the ETL calculates `total_users' for FOG and here is where it does for telemetry.

In principle, looking at this they don't seem to be doing anything different, but I'm also not the best person to look at it. And since Arkadiusz is on pto, @relud would you mind taking a look at this, please?

relud commented 2 years ago

The dates being referenced here are not the date the probe was submitted, they are the app_build_id that submitted the probe

relud commented 2 years ago

On May 30, nightly channel, wr.rasterize_glyphs_time only recorded 63 clients. This number is way lower than what we would expect, which should be in the 20k - 30k range as seen in [this query(https://sql.telemetry.mozilla.org/queries/86234/source#213540).

it is more accurate to say that app_build_id 2022053009 only recorded 63 clients on the most recent run date of 2022-05-31

relud commented 2 years ago

Changing aggregation level from Build ID to Major Version causes all sorts of weirdness because the page doesn't correctly filter the inputs to WHERE app_build_id = "*", so there are multiple results per major version and it seems to be randomly selecting a value.

Iinh commented 2 years ago

Changing aggregation level from Build ID to Major Version causes all sorts of weirdness because the page doesn't correctly filter the inputs to WHERE app_build_id = "*", so there are multiple results per major version and it seems to be randomly selecting a value.

I'll look into it.

relud commented 2 years ago

I believe the problem here is that bigquery_etl/glam/templates/clients_histogram_aggregates_v1.sql doesn't correctly handle NULL when merging histogram_aggregates, causing histogram_aggregates to be emptied any time a client is not continuously active. https://github.com/mozilla/bigquery-etl/pull/3006

relud commented 2 years ago

I'm working on backfilling this using the backfill-slots project

relud commented 2 years ago

I'm still working on backfilling all impacted tables, but backfill for firefox_desktop nightly is complete

@Iinh this page should be correct now: wr.rasterize_glyphs_time

relud commented 2 years ago

I completed the backfill, but clients volume still seems off for release.

Iinh commented 2 years ago

I completed the backfill, but clients volume still seems off for release.

I did notice that the client count is really high for release, compared to nightly and beta (millions vs thousands). Querying in STMO shows that the high client count for release we see in GLAM is in line with what we should expect. I'm not clear on how this metric is instrumented, wondering if @chutten or anyone on the Glean team (@badboy?) could provide some context?

badboy commented 2 years ago

I completed the backfill, but clients volume still seems off for release.

Also seems off for Nightly and Beta still?

I did notice that the client count is really high for release, compared to nightly and beta (millions vs thousands). Querying in STMO shows that the high client count for release we see in GLAM is in line with what we should expect. I'm not clear on how this metric is instrumented, wondering if @chutten or anyone on the Glean team (@badboy?) could provide some context?

But that general trend seems correct, but the numbers are off. Right now on GLAM it says 182 clients for nightly still. Your STMO query shows numbers that I would expect and at least for release the clients with rasterize data seems to be on an uptick, much like any rollout graph. The metric is recorded in some webrender code, that should be enabled for a majority of clients for sure.

I haven't read the full details here, but it seems that GLAM still miscalculated the client value somehow.

Iinh commented 2 years ago

Pulling @akkomar in to take a further look at this issue as he's been looking at the etl code more closely. (thanks Arkadiusz!)