mwvgroup / Pitt-Google-Broker

A Google Cloud-based alert broker for LSST and ZTF
https://pitt-broker.readthedocs.io/en/latest/index.html
4 stars 0 forks source link

Replace the Metadata Collector with BigQuery subscriptions #172

Open troyraen opened 2 years ago

troyraen commented 2 years ago

Overview

This issue proposes an entirely new workflow for the broker's metadata collection process. It also documents our reasons for collecting specific metadata to begin with. #171 details fundamental problems with the current system design.

The new workflow can potentially solve many problems at once, at least two of which will be significant barriers to processing at LSST scale if left unaddressed. It is based on a new Pub/Sub feature: BigQuery subscriptions.

Here is an architecture diagram for reference (especially in the section outlining the new workflow).

Motivation for a metadata collector

Enable the following project-level goals:

Outline of new workflow

BigQuery subscriptions is a new feature in Pub/Sub that basically let you point a push subscription directly at a BigQuery table. It includes the following important options, which make the new workflow possible: 1) write both the message data and metadata to the table, and 2) automatically drop fields that are present in the message but not the table schema.

Basic envisioned workflow:

  1. Each module will publish one output stream (i.e., topic) that contains: 1) the results of its processing, 2) relevant metadata, and 3) (usually) a copy of the incoming data/alert.
  2. Each module's topic will have a BigQuery subscription attached. The subscription will automatically send the message to a BigQuery table, dropping any data/metadata that is not in the table's schema.
    • Things like the original alert data can be included in every message, but only stored in a single table.
    • Similar modules (like classifiers) can all push results to the same table.
    • Modules will no longer load to BigQuery tables directly. (Those API calls can then be removed, which would completely eliminate this problem.)
  3. As a result of the above, a given module's metadata will be stored in the same table as its results. Thus, tracking the journey of a single alert through our pipeline will involve table joins. We may want to create a materialized view to make such queries easier.

Specific tasks that would be required


*Note: Some of the module's resources (its Pub/Sub "counter" subscriptions) also provide the data that is currently displayed by the broker's live-monitoring system. We should redesign that system to scrape data from logs instead, which is a more standard workflow there. Tagging #109.

wmwv commented 2 years ago

I'm sold.

troyraen commented 1 year ago

tagging @hernandezc1