Open wlach opened 3 years ago
Add support for outputting dataset and udf metadata to JSON endpoints in bigquery-etl, in addition to markdown documentation
Are you imagining including these as part of the rendered docs we publish to GitHub Pages? I'm open to this, but had imagined instead some scheduled build of Glean Dictionary that would pull down a tar archive of the generated-sql
branch and do whatever data prep is necessary for Glean Dictionary.
We should consider availability and whether it matters. For the probeinfo service, we push the data to S3 and serve it over Cloudfront, which is about as resilient a solution for serving static data over http as you're going to get. I have no reservations about uptime of Glean Dictionary relying on probeinfo endpoints. But perhaps GitHub pages is perfectly sufficient for our needs. Or perhaps this is a concern we can address later should availability become an issue.
Are you imagining including these as part of the rendered docs we publish to GitHub Pages? I'm open to this, but had imagined instead some scheduled build of Glean Dictionary that would pull down a tar archive of the
generated-sql
branch and do whatever data prep is necessary for Glean Dictionary.
That works too and might make slightly more sense. I think for the purposing of prototyping, providing an "API" as part of the rendered documentation will work slightly better (since we can take advantage of the deploy preview feature to test things out) but we can keep this in our back pocket as an option.
We should consider availability and whether it matters. For the probeinfo service, we push the data to S3 and serve it over Cloudfront, which is about as resilient a solution for serving static data over http as you're going to get. I have no reservations about uptime of Glean Dictionary relying on probeinfo endpoints. But perhaps GitHub pages is perfectly sufficient for our needs. Or perhaps this is a concern we can address later should availability become an issue.
Yeah, I'm not too worried about availability. If GitHub is down, a slightly out of date glean dictionary is probably the least of worries. That said, I think it might make sense to move rendered documentation for bigquery-etl to our internal systems for other reasons (e.g. being able to do pageload-based analytics).
I feel like maybe the best way forward is come kind of metadata standard inside bigquery-etl that links the tables it generates to either specific glean application ids or application names (for tables that aggregate several application variants together). If a table or view does not specify this information, it will not appear in the glean dictionary. Does anyone have thoughts or preferences here? There are various alternatives (e.g. scraping SQL or using @amiyaguchi's etl-graph approach https://etl-graph.protosaur.dev/) but this seems like the simplest solution.
I may end up writing a proposal for this after all.
I could imagine something like:
...
applications:
- app_id: org.mozilla.ios.FirefoxBeta
channel: beta
...
to live in metadata.yaml
. I guess it is possible that one table can be related to multiple Glean applications (like mobile_search_clients_daily).
In bigquery-etl, we already have labels.application
which seems closely related to this need. We could have an expectation that this label is set to a relevant Glean app_name
. The format for BQ labels is somewhat restricted, so would not accommodate app_id
values, but should accommodate app_name
and channel
values.
Even better, we could replace labels.application
with labels.app_name
to make it explicit that these are app_name values. A combination of app_name
and channel
would make it unambiguous which app_id it's related to. I think it would be great to have these also appear as labels on the BQ tables.
What I haven't thought through in the above is what to do for data not derived from Glean pings, and whether we'd want to allow additional non-Glean values for the app_name label.
Great ideas @scholtzan @jklukas . A few notes:
app_id
value, so I don't think we should specify it in bigquery-etl - we can just refer people to https://dictionary.protosaur.dev/apps/fenix?itemType=app_ids if a reference is needed (presumably someone working on bigquery-etl is comfortable with these concepts).app_name
and channel
is not always a unique combination, see the fenix example above (although maybe the overlap between deprecated and non-deprecated applications doesn't really matter that much in practice, since these labels are just a hint).I don't know what to do for non-Glean data either. Maybe that's a later problem-- or this could also be another example of us providing a carrot to get people to migrate things over.
We do now require unique non-deprecated channel/app combos: https://github.com/mozilla/probe-scraper/pull/282
In bigquery-etl, we already have labels.application which seems closely related to this need.
One shortcoming of using labels would be that only exactly one application can be specified. Keys in yaml must be unique (The content of a mapping node is an unordered set of key: value node pairs, with the restriction that each of the keys is unique)
The Glean Dictionary seems like a better long-term place for dataset documentation than bigquery-etl, since we can more easily cross-link the information with the lower-level table and metric data which produces it.
This issue will cover the initial MVP implementation of this.
Rough steps:
Tagging a few people who might be interested in this workstream: @scholtzan @Iinh @jklukas @relud @rafrombrc