mozilla / protodash

Apache License 2.0
2 stars 7 forks source link

Add visit counts for protodash pages #25

Open mreid-moz opened 3 years ago

mreid-moz commented 3 years ago

This would help us understand how much prototypes are being used.

wlach commented 3 years ago

If we implement #30, people could self-serve google analytics on subdomains without too much difficulty.

acmiyaguchi commented 3 years ago

It's possible to enable logs on individual gcp buckets via terraform (see https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket#logging and https://cloud.google.com/storage/docs/access-logs). These logs contain the ip address and the object that was requested, dumped about once a day. It would require the bucket to be managed by ops, but something like the following could happen:

  1. Bucket in the yaml file is added as a terraform resource. Bucket logging is enabled to some centralized protodash logging bucket.
  2. Once a day, the generated usage logs are loaded into a BigQuery table. Immediately after the job finishes, a query groups distinct ips per resource per timestamp day.
  3. Query results from above are dumped into a file and served via protosaur.

The query would look something like this, assuming analysis.protodash_usage is contains bucket usage logs.

WITH
  extracted AS (
  SELECT
    TIMESTAMP_MICROS(time_micros) AS timestamp,
    cs_bucket,
    c_ip,
  FROM
    analysis.protodash_usage )
SELECT
  TIMESTAMP_TRUNC(timestamp, day) AS timestamp_day,
  cs_bucket,
  COUNT(DISTINCT c_ip) AS n
FROM
  extracted
GROUP BY
  1
ORDER BY
  1
wlach commented 3 years ago

This topic came up again yesterday in the context of the numbers that matter dashboard. Knowing counts of numbers of visits is helpful, but it is sometimes important to understand who is using a particular resource. For example, if a dashboard is aimed at high-level decision-makers, we would want to know if they (or someone reporting to them) is looking at it.

Google Analytics has a "user id" feature which we could associate with the login on authenticated dashboards (which probably correspond to the cases where we'd want fine-grained analytics on who is accessing stuff and when):

https://support.google.com/analytics/answer/3123662?hl=en https://www.lovesdata.com/blog/google-analytics-user-id

acmiyaguchi commented 3 years ago

Audit logging of the resources (server-side) would probably have a higher fidelity, especially if ad-blockers are being run which interfere with GA's data collection. The audit log object includes authentication information for bucket access that's behind IAM of some sort.

acmiyaguchi commented 3 years ago

I've set up something simple for page visits to https://protosaur.dev/mps-deploys/ in this data-sandbox-terraform PR. The logs are written to a bigquery table pretty much instantaneously, although the principal that is logged is the protodash service account (i.e. the request is being proxied). If the audit logs are set up in the main protodash project, it's likely we can count page loads by authenticated user.

wlach commented 3 years ago

I've set up something simple for page visits to https://protosaur.dev/mps-deploys/ in this data-sandbox-terraform PR. The logs are written to a bigquery table pretty much instantaneously, although the principal that is logged is the protodash service account (i.e. the request is being proxied). If the audit logs are set up in the main protodash project, it's likely we can count page loads by authenticated user.

This seems like a good way forward if we can make it work.