wlach commented 3 years ago

The Glean Dictionary seems like a better long-term place for dataset documentation than bigquery-etl, since we can more easily cross-link the information with the lower-level table and metric data which produces it.

This issue will cover the initial MVP implementation of this.

Rough steps:

[ ] Add support for outputting dataset and udf metadata to JSON endpoints in bigquery-etl, in addition to markdown documentation
[ ] Add support inside the Glean Dictionary for consuming this information and rendering it as svelte (same strategy as we use for everything else)
[ ] Make sure everyone is happy
[ ] Remove this documentation from bigquery-etl, refocusing https://mozilla.github.io/bigquery-etl/ on the developer story (e.g. creating datasets or otherwise maintaining enhancing this part of our pipeline)

Tagging a few people who might be interested in this workstream: @scholtzan @Iinh @jklukas @relud @rafrombrc

jklukas commented 3 years ago

Add support for outputting dataset and udf metadata to JSON endpoints in bigquery-etl, in addition to markdown documentation

Are you imagining including these as part of the rendered docs we publish to GitHub Pages? I'm open to this, but had imagined instead some scheduled build of Glean Dictionary that would pull down a tar archive of the generated-sql branch and do whatever data prep is necessary for Glean Dictionary.

We should consider availability and whether it matters. For the probeinfo service, we push the data to S3 and serve it over Cloudfront, which is about as resilient a solution for serving static data over http as you're going to get. I have no reservations about uptime of Glean Dictionary relying on probeinfo endpoints. But perhaps GitHub pages is perfectly sufficient for our needs. Or perhaps this is a concern we can address later should availability become an issue.

wlach commented 3 years ago

Are you imagining including these as part of the rendered docs we publish to GitHub Pages? I'm open to this, but had imagined instead some scheduled build of Glean Dictionary that would pull down a tar archive of the generated-sql branch and do whatever data prep is necessary for Glean Dictionary.

That works too and might make slightly more sense. I think for the purposing of prototyping, providing an "API" as part of the rendered documentation will work slightly better (since we can take advantage of the deploy preview feature to test things out) but we can keep this in our back pocket as an option.

We should consider availability and whether it matters. For the probeinfo service, we push the data to S3 and serve it over Cloudfront, which is about as resilient a solution for serving static data over http as you're going to get. I have no reservations about uptime of Glean Dictionary relying on probeinfo endpoints. But perhaps GitHub pages is perfectly sufficient for our needs. Or perhaps this is a concern we can address later should availability become an issue.

Yeah, I'm not too worried about availability. If GitHub is down, a slightly out of date glean dictionary is probably the least of worries. That said, I think it might make sense to move rendered documentation for bigquery-etl to our internal systems for other reasons (e.g. being able to do pageload-based analytics).

wlach commented 3 years ago

526 maps the namespace of the dataset name to application ids, which lets me generate a respectable set of autogenerated tables we create for each glean application. This does not include other datasets, however (e.g. the mobile search tables are not included because they live in their own dataset).

I feel like maybe the best way forward is come kind of metadata standard inside bigquery-etl that links the tables it generates to either specific glean application ids or application names (for tables that aggregate several application variants together). If a table or view does not specify this information, it will not appear in the glean dictionary. Does anyone have thoughts or preferences here? There are various alternatives (e.g. scraping SQL or using @amiyaguchi's etl-graph approach https://etl-graph.protosaur.dev/) but this seems like the simplest solution.

I may end up writing a proposal for this after all.

scholtzan commented 3 years ago

I could imagine something like:

...
applications:
  - app_id: org.mozilla.ios.FirefoxBeta
    channel: beta
...

to live in metadata.yaml. I guess it is possible that one table can be related to multiple Glean applications (like mobile_search_clients_daily).

jklukas commented 3 years ago

In bigquery-etl, we already have labels.application which seems closely related to this need. We could have an expectation that this label is set to a relevant Glean app_name. The format for BQ labels is somewhat restricted, so would not accommodate app_id values, but should accommodate app_name and channel values.

Even better, we could replace labels.application with labels.app_name to make it explicit that these are app_name values. A combination of app_name and channel would make it unambiguous which app_id it's related to. I think it would be great to have these also appear as labels on the BQ tables.

What I haven't thought through in the above is what to do for data not derived from Glean pings, and whether we'd want to allow additional non-Glean values for the app_name label.

wlach commented 3 years ago

Great ideas @scholtzan @jklukas . A few notes:

Channel can always be derived from the app_id value, so I don't think we should specify it in bigquery-etl - we can just refer people to https://dictionary.protosaur.dev/apps/fenix?itemType=app_ids if a reference is needed (presumably someone working on bigquery-etl is comfortable with these concepts).
Unfortunately app_name and channel is not always a unique combination, see the fenix example above (although maybe the overlap between deprecated and non-deprecated applications doesn't really matter that much in practice, since these labels are just a hint).

I don't know what to do for non-Glean data either. Maybe that's a later problem-- or this could also be another example of us providing a carrot to get people to migrate things over.

fbertsch commented 3 years ago

We do now require unique non-deprecated channel/app combos: https://github.com/mozilla/probe-scraper/pull/282

scholtzan commented 3 years ago

In bigquery-etl, we already have labels.application which seems closely related to this need.

One shortcoming of using labels would be that only exactly one application can be specified. Keys in yaml must be unique (The content of a mapping node is an unordered set of key: value node pairs, with the restriction that each of the keys is unique)

mozilla / glean-dictionary

Add dataset (and maybe UDF?) metadata from bigquery-etl #496

526 maps the namespace of the dataset name to application ids, which lets me generate a respectable set of autogenerated tables we create for each glean application. This does not include other datasets, however (e.g. the mobile search tables are not included because they live in their own dataset).