Replace the Metadata Collector with BigQuery subscriptions

troyraen commented 2 years ago

Motivated by #171 and work currently in progress.
Proposes a more fundamental solution to the problems that #92 attempts to address, and has a broader impact.
Proposes a workflow that would allow us to simply remove all of the BigQuery API calls that #96 (specifically this comment) identifies as majorly problematic.

Overview

This issue proposes an entirely new workflow for the broker's metadata collection process. It also documents our reasons for collecting specific metadata to begin with. #171 details fundamental problems with the current system design.

The new workflow can potentially solve many problems at once, at least two of which will be significant barriers to processing at LSST scale if left unaddressed. It is based on a new Pub/Sub feature: BigQuery subscriptions.

Here is an architecture diagram for reference (especially in the section outlining the new workflow).

Motivation for a metadata collector

Enable the following project-level goals:

Track data provenance. Specifically, the broker and classifier versions that processed each alert.
Track pipeline performance. Specifically: module-level processing times, delay times, dropped alerts, duplicated alerts, etc.

Outline of new workflow

BigQuery subscriptions is a new feature in Pub/Sub that basically let you point a push subscription directly at a BigQuery table. It includes the following important options, which make the new workflow possible: 1) write both the message data and metadata to the table, and 2) automatically drop fields that are present in the message but not the table schema.

Basic envisioned workflow:

Each module will publish one output stream (i.e., topic) that contains: 1) the results of its processing, 2) relevant metadata, and 3) (usually) a copy of the incoming data/alert.
Each module's topic will have a BigQuery subscription attached. The subscription will automatically send the message to a BigQuery table, dropping any data/metadata that is not in the table's schema.
- Things like the original alert data can be included in every message, but only stored in a single table.
- Similar modules (like classifiers) can all push results to the same table.
- Modules will no longer load to BigQuery tables directly. (Those API calls can then be removed, which would completely eliminate this problem.)
As a result of the above, a given module's metadata will be stored in the same table as its results. Thus, tracking the journey of a single alert through our pipeline will involve table joins. We may want to create a materialized view to make such queries easier.

Specific tasks that would be required

[ ] Design, create, and attach an Avro schema to every topic. Also, update every module to format outgoing messages using the schema attached to their topic.
[ ] Attach and configure a BigQuery subscription to each topic.
[ ] Remove the BigQuery API calls from every module that currently sends data there.
[ ] Review the tables we're currently keeping in BigQuery and decide if we want to restructure.
[ ] Drop the BigQuery Storage module. The full alert packet (minus cutouts) can go directly from the alerts stream to the alerts table. The DIASource table can either be a view on the alerts table, or the Cloud Storage module can produce the stream that populates it.
[ ] Drop the Metadata Collector module.
[ ] Consider creating a materialized view that joins the metadata from each table.
[ ] Redesign the broker's live-monitoring system (see below).

*Note: Some of the module's resources (its Pub/Sub "counter" subscriptions) also provide the data that is currently displayed by the broker's live-monitoring system. We should redesign that system to scrape data from logs instead, which is a more standard workflow there. Tagging #109.

wmwv commented 2 years ago

I'm sold.

troyraen commented 1 year ago

tagging @hernandezc1

mwvgroup / Pitt-Google-Broker