mozilla / gcp-ingestion

Documentation and implementation of telemetry ingestion on Google Cloud Platform
https://mozilla.github.io/gcp-ingestion/
Mozilla Public License 2.0
75 stars 31 forks source link

Periodically update schemas from mozilla-pipeline-schemas #116

Closed jklukas closed 5 years ago

jklukas commented 5 years ago

Currently, we fetch mozilla-pipeline-schemas from GitHub at ingestion-beam build time and include the content in the jar. This means we can only get schema updates by rebuilding, draining the existing Dataflow job, and instantiating a new job with the new code.

We could try to spin up a periodic task (every 5 minutes, perhaps?) to get the latest content from GitHub and update the collection of schemas. Perhaps this could be expressed nicely as a side input?

jklukas commented 5 years ago

I see two main options we could take to run a task periodically:

I'm inclined to use the @OnTimer method, though documentation is a bit scarce about how it's actually executed. The Beam docs state "each instance of your function object is accessed by a single thread at a time on a worker instance, unless you explicitly create your own threads", so I'm inferring that the @OnTimer method will block processing of additional records while it runs. This is probably fine, but we'll want to put a reasonably small timeout on our call to GitHub so that we don't accidentally clog the pipeline. Fail fast, log the error, and continue without updating the schemas.

jklukas commented 5 years ago

I'd like to use the same script (bin/update-schema) to pull down mozilla-pipeline-schemas into resources at build time and to make files available for the @OnTimer method.

We'll maintain a mozillaPipelineSchemasTimestamp that tracks when the repo was last modified. The @OnTimer method will make a call to the GiHub API to see whether the repo has a newer last modified timestamp (or perhaps for the dev branch specifically). If it already has the newest, it end with no changes. If it gets a newer timestamp, then it runs the bin/update-schemas script and loads the schemas from those files.

If any of the calls to GitHub fail, we log and move on so we don't hold up processing of messages. We'll try again when the timer next fires.

jklukas commented 5 years ago

Rather than maintaining a timestamp, perhaps it's better to maintain the last commit id. We can use the get branch endpoint and extract commit.shafrom this url:

https://api.github.com/repos/mozilla-services/mozilla-pipeline-schemas/branches/dev

jklukas commented 5 years ago

Wrote up a bug with design thoughts for a solution where we publish schema changes to PubSub instead: https://bugzilla.mozilla.org/show_bug.cgi?id=1502057

jklukas commented 5 years ago

Based on discussion in https://github.com/mozilla/gcp-ingestion/issues/145, we're simply going to expect to issue a rebuild and update to get new schemas.