Support sink and network metadata within an `spkg`

New units of deployments:

Subgraphs are deployable units: packages that can be sent to someone with the proper runtime (i.e. graph-node), and turned into a long-running service.
Substreams are packageable, but are not deployable units themselves, since they don't offer persistence nor some query layer.
- They are only the transformation layer. The sinks are the components that turn a Substreams into something deployable, where with the proper runtime, it can be turned into a fully fledged service (graph-node being one of them).

What is needed:

Each sink needs a Package (spkg) that defines the source of its data, by definition.
Most sinks require metadata to be deployable autonomously (e.g. postgres needs some schemas, prometheus needs metrics descriptions, csv needs lists of tables / fields).
Instead of inventing a different file format for each sink, we will extend the spkg format to accommodate an optional sink_config configuration.
While we're there, we will add a target_network parameter, required by any sink that wants to issue a Substreams request.
This turns spkgs into optionally deployable units.

Concretely that means that the Package message definition be augmented in this way:

import "google/protobuf/any.proto";

message Package {
  repeated google.protobuf.FileDescriptorProto proto_files = 1;
  reserved 2; // In case protosets add a field some day.
  ...
  uint64 version = 5;
  sf.substreams.v1.Modules modules = 6;
  repeated ModuleMetadata module_meta = 7;
  repeated PackageMetadata package_meta = 8;

  ////// ADD THESE NEW FIELDS:

  // Source network for Substreams to fetch its data from.
  string network = 9;

  google.protobuf.Any sink_config = 10;
  string sink_module = 11;
}

Benefits:

We keep the feature of the spkg file being a self-describing message.
That self-describing message can be dynamically augmented with the sink's protobuf models, allowing general purpose inspection of the contents of an spkg by the substreams CLI tool. Both in terms of Protobuf schema definition as well as sink metadata.
Very simple tooling (e.g. substreams inspect) allows an Indexer to discover how an spkg can be deployed. Even some simple bash scripts.
The pattern allows for a sink to also embed a query layer, in the same fashion, keeping the same self-description properties, with the same ease of discovery.
Because of the format of a protobuf message, it's possible to merely append bytes to the spkg to tack on some additional proto_files and a sink_config, making very easy to take a Substreams package, and configure it for a given sink.
For a sink, opening the package allows it to pick up its source Substreams modules at the same time it picks up its runtime configuration (provided by the user). An operator can then blend user-provided configs, with operator-provided configs (and a mapping of network names -> endpoints), and deploy such a package in production.

Prometheus example

The substreams-prometheus-sink reads Substreams output in a certain shape, and writes to Prometheus - a popular time series database. However, certain things need to be known about the data being written - schemas, help strings, declaration of metrics, etc.. The current Package format has no space to accomodate such metadata.

Tooling

Example flow from developer to indexer operation, wanting to deploy a kvsink with the fixed gRPC endpoint:

Dev crafts the Substreams modules, writes the kv_out following https://github.com/streamingfast/substreams-sink-kv
Drafts a substreams.yaml manifest with included sink configuration:

specVersion: v0.1.0
package:
  name: my-eth-transfers-kvsink
  version: v2.3.2

imports:
  mod: ./substreams.yaml
  kvsink: https://github.com/releases/substreams-sink-kv-v1.0.1.spkg

protobuf:
  files:
    - sf/custom/v1/service.proto
  importPaths:
    - ./proto

network: goerli

sink:
  module: mod:kv_out
  type: sf.substreams.sink.kv.v1.WASMQueryConfig
  config:
    initialBlock: 12_000_000
    # @ for text files, @@ for binary files, \@ if you want an explicit at sign.
    wasmQueryModule: @@target/wasm32/release/mycode.wasm
    grpcService: sf.custom.v1.Service

### or:

#sink:
#  module: mod:kv_out
#  type: sf.substreams.sink.kv.v1.GenericConfig
#  config:
#    initialBlock: 12_000_000

For a gRPC service like:


package sf.custom.v1
service Services {
  rpc SayHello(HelloRequest) returns (HelloReply) {}
}

The dev runs:

$ substreams pack sink-kv.yaml
Writing mysink-v2.0.1.spkg

The spkg file contains the sf.substreams.sink.kv.v1.GenericService configuration protobuf message in the sink_config field, as well as the Substreams modules.

The dev ships the file to an indexer
The indexer then inspects the contents:

substreams sink-config ./mysink-v2.0.1.spkg
{"@type": "sf.substreams.sink.kv.v1.GenericService",
 "@module": "mod:kv_out",
 "@network": "ethereum:mainnet",
 "initialBlock": "123"
 "wasmQueryBinary": "F76G8720392831..02938102938=="
}

The prometheus sink could provide the tooling necessary to take in its manifest (ideally staying close to the Substreams experience, like prometheus-sink.yaml), and build up its protobuf message and "bundle" it with the spkg, turning the spkg package into a deployable unit.

Where an example prometheus manifest (prometheus_sink.yaml) could look like:

specVersion: v0.1.0
package:
  name: nicegraphs
  version: v2.0.4

imports:
  streams: ./substreams.yaml
  sink: https://github.com/releases/prometheus-sink-v1.2.0.spkg

sink:
  module: streams:prom_out
  type: sf.susbtreams.sink.prometheus.v1.PrometheusAndGrafana
  config:
    initialBlock: -1000
    metrics:
      "this_metric": Help string of the metric
      "this_other_metric": Help string of that metric
    labels:
      "this_label": Meaning of that label
    grafana_dashboards: @grafana.json

and where the grafana.json file would be packaged in, and attached to the spkg.

This package would have everything needed for a successful deployment, as a single deployable unit.

Implementable as:

substreams pack ./substreams-prometheus.yaml

You can then imagine a Kubernetes operator that is passed down such an spkg, and spins up services automatically.

Key/value sink specs:

Postgres:

message sf.substreams.postgres.v1.HasuraQueryService {
  string source_module = 2;
  string schema = 1;
  string override_initial_block = 3;
  string hasura_config = 4;
}
message sf.substreams.postgres.v1.WASMQueryService {
  string schema = 1;
  string source_module = 2;
  string override_initial_block = 3;
  // wasm exports: "pg_query"
  bytes wasm_query_module = 4;
  string grpc_service = 5;
}

MongoDB:

message sf.substreams.mongodb.v1.WASMQueryService {
  string source_module = 2;
  // wasm exports: "mongo_query"
  bytes wasm_query_module = 4;
 }

Standardization of tools

Ideally, each tool standardizes around the same verb pack to take its manifest, and turn it into an spkg.

substreams-sink-postgres pack ./my-postgres-manifest.yaml
substreams-sink-prometheus pack ./my-prom-manifest.yaml
substreams-sink-kv pack ./my-kv-manifest.yaml

to kickstart some conventions.

`substreams inspect`

We want to ensure inspect outputs something consumable by scripts, structured in JSON, or as it is right now (in a sort of flag key / value display), so that someone can do:

substreams inspect my.spkg | grep -A15 "^sink_config:"

and do some simple env variable replacement, and parameter passing.

In JSON:

SINK_TYPE=$(substreams inspect --json my.spkg | jq .sink_config.@type)
if $SINK_TYPE eq "my.sink.v1.Target"; do ...

inspect would also decode any Any fields, recursively, if they are available in the proto_files specs of the Package itself.

👍 great proposal @abourget really like the idea of bundled Substreams packages meant to handle all components (Extract,Load,Query)

As for filenames, not sure about the extended types, the reason is because these files will most likely end up as IPFS hashes or randomly named packages, however, usually what doesn't change is the file extension.

My preference would be to not worry too much about the entire filename itself but make sure it's using *. spkgs (ex: Bundled *.spkg)

Filenames would look like:

QmPpLvBDJ9TbG7syHGUjGvyiFzYB9H3M5FuZjzWnQovcQx.spkgs
mysubstreamsmod-v1.0.2.spkg (only substreams map/stores)
mysubstreamsmod-v1.0.2.spkgs (bundled substreams with services)

As long as there's a way to inspect the modules & services via:

substreams inspect <package>
substreams info [<manifest_file>]

As for bundling, could look something like this:

substreams bundle [<manifest_file>]

The bundle tool would need to be in each sink, though, unless there's a generalizable packer from sink manifests to a corresponding Protobuf message (that the substreams CLI doesn't know about initially?).

There are two options:

Either we have a singlesink_meta in the Substreams Package. And perhaps that sink protobuf contains its own query_meta.
or we have sink_meta + query_meta in the Substreams Package.

Option 1) means that we can't know if there's a query config in the package, because the substreams CLI is general, and doesn't know about all the possible sinks. All it can know is that there is a sf.whatever.sink.v1.KV configured in this package. You'll need to use the sinkkv toolkit to view the specific configuration. UNLESS the bundle operation also adds the required protobuf, in which case you could have a JSON view of the sink_meta.

Option 2) would allow us to know more about what's in the package, know if it's a deployable query layer also. Having that top-level would allow the package to be characterised as "deployable reading software", and the _sinkmeta could be characterised as "deployable writing software" taking its input from a sink.

There may be multiple possible sinks for one substream? How to handle that?

This is a dump of the previous issue's content, for reference. The main comment of the issue will contain what we've decided to go forward with.

Subgraphs are deployable units: packages that can be sent to someone with the proper runtime, and turned into fully fledged service.

Substreams are packageable, but are not deployable units themselves, since they don't offer a query layer. They are only the transformation layer. The sinks are the components that turn a Substreams into something deployable, where with the proper runtime, it can be turned into a fully fledged service (graph-node being one of them).

I propose that Substreams Packages be augmented with a single field, called sink_meta of type pbany.Any at the end of https://github.com/streamingfast/substreams/blob/develop/proto/sf/substreams/v1/package.proto#L10-L22

Adding a single field at the top-level Package, means that a serialized spkg file could simply be appended with additional metadata.

Prometheus example

With a conventional field of type Any (which includes a fully qualified protobuf message name, and serialized bytes for that message), the substreams CLI could print whether some sink metadata is attached (with its type). Optionally provide details if the sink metadata becomes well known. Sinks could read their Substreams dependencies and configuration from a single location: the spkg file.

Postgres example

Another example would be PostgreSQL sink:

the Substreams module could prepare the data for output, but it does not make sense for any map or store module to provide the schema of the target database
having sink_meta information could convey the SQL schema needed to initialize the database when the sink needs it.

Tooling

The prometheus sink could provide the tooling necessary to take in its manifest (ideally staying close to the Substreams experience, say prometheus-sink.yaml), and build up its protobuf message and "bundle" it with the spkg, turning the spkg package into a deployable unit.

I'm unsure if the extension should change to indicate that an spkg is now a bundled package. Some possibilities:

mysubstreamsmodule-v1.0.2.spkg (bundled and unbundled alike)
mysubstreamsmodule-v1.0.2.spkgs meaning bundled with a sink configuration
mysubstreamsmodule-v1.0.2.prometheus.spkg to indicate the spkg is bundled with prometheus configuration?
mysubstreamsmodule-v1.0.2.nicegraphs-v2.0.4+prometheus.spkg to indicate the versions of both layers (transform and load/query)
mysubstreamsmodule-v1.0.2.nicegraphs-v2.0.4.prometheus.spkg ? This seems like the most complete.

Where an example prometheus manifest (prometheus_sink.yaml) could look like:

package:
  name: nicegraphs
  version: v2.0.4

source:
  package: my-package.spkg
  module: prom_out
  #start_block: -1000

metrics:
  this_metric: Help string of the metric
  this_other_metric: Help string of that metric
labels:
  this_label: Meaning of that label

grafana_dashboards: grafana.json

and where the grafana.json file would be packaged in, and attached to the spkg.

This package would have everything needed for a successful deployment, as a single deployable unit.

Implementable as:

substreams-sink-prometheus pack ./manifest.yaml

Prior art

Docker containers are another deployable unit type that exists. However, they require a very different sandboxing and trust layer, making it more risky to receive packages to deploy from unknown participants in a permissionless network
WASM (and WASI) in particular are becoming nicely isolated deployable components, where the WebAssembly runtimes can be much more isolated and secure than Docker images. However this wouldn't accomodate a large number of software that is desirable to deploy from an indexing perspective: the whole of Prometheus + Grafana, PostgreSQL and the likes, won't be able to run easily in the tight WASM environment.

Having specialized runtimes, with Substreams Packages as deployable units seem to be very fitting to our use cases, and warrant the further development of its format.

Is this the right abstraction?

If we consider Substreams as the transformation layer, and the Prometheus insertion as the "load" layer. We should think of the Grafana dashboards as the "query" layer, separate from "load". If we stretch our thinking here, we should imagine how to accomodate that additional query_metadata field right away.

Filenames could look like:

mysubstreamsmod-v1.0.2.myprommetrics-v3.2.3.myhugedashboards-v4.3.2.prometheus.grafana.spkg
mysubstreamsmod-v1.0.2.myprommetrics-v3.2.3.prometheus.myhugedashboards-v4.3.2.grafana.spkg
mysubstreamsmod-v1.0.2_myprommetrics-v3.2.3_myhugedashboards-v4.3.2.spkg (and leave the users name things meaningfully)
mysubstreamsmod-v1.0.2_myprommetrics-v3.2.3_myhugedashboards-v4.3.2.spkg+prom+grafana

That's pretty unwieldy, but we can imagine someone wanting to just update the dashboards, without redeploying the sink with its config (not reload postgres from scratch, etc..)

/cc @DenisCarriere @azf20 @fubhy

Example flow from developer to indexer operation, wanting to deploy a kvsink with the fixed gRPC endpoint:

Dev crafts the Substreams modules, writes the kv_out following https://github.com/streamingfast/substreams-sink-kv
Drafts a substreams.yaml manifest with included sink configuration:

specVersion: v0.1.0
package:
  name: my-eth-transfers-kvsink
  version: v2.3.2

imports:
  mod: ./substreams.yaml
  kvsink: https://github.com/releases/substreams-sink-kv-v1.0.1.spkg

protobuf:
  files:
    - sf/custom/v1/service.proto
  importPaths:
    - ./proto

sink:
  @type: sf.substreams.sink.kv.v1.WASMQueryConfig
  inputModule: mod:kv_out
  initialBlock: 12_000_000
  # @ for text files, @@ for binary files, \@ if you want an explicit at sign.
  wasmQueryModule: @@target/wasm32/release/mycode.wasm
  grpcService: sf.custom.v1.Service

sink:
  @type: sf.substreams.sink.kv.v1.GenericConfig
  inputModule: mod:kv_out
  initialBlock: 12_000_000

For a gRPC service like:


package sf.custom.v1
service Services {
  rpc SayHello(HelloRequest) returns (HelloReply) {}
}

The dev runs:

$ substreams-sink-kv pack sink-kv.yaml
Writing mysink-v2.0.1.spkg

The spkg file contains the sf.substreams.sink.kv.v1.GenericService configuration protobuf message in the sink_config field, as well as the Substreams modules.

The dev ships the file to an indexer
The indexer then inspects the contents:

substreams inspect ./mysink-v2.0.1.spkg | grep ^target_sink |grep sf.substreams.kv.v1.SinkGenericQuery

and deploys or not, depending on whether he knows how to deploy such a unit:

if $? != 0; do echo Unsupported sink; exit 1; done

NETWORK=$(substreams tools network-env-var ./mysink-v2.0.1.spkg)  // transforms the `ethereum:mainnet`  value in `target_network` into ETHEREUM_MAINNET, and resolves any aliases
ENVVAR=MY_ENDPOINTS_CONFIGS_$NETWORK
ENDPOINT=${!ENVVAR}
if $ENDPOINT == ""; do echo Unsupported network $NETWORK; exit 1; done

substreams-sink-kv run -e $ENDPOINT mysink-v2.0.1.spkg

Of course, any more sophisticated deployment machinery can be built around, but the simple case is possible.

No additional comments, this plan to package sinks into deployable units sounds like a great idea.

There will be some additional tooling required to "pack" in the various sinks, but won't shouldn't be a barrier.

Simple manifests (ex: sink-kv.yaml) shouldn't be too hard for users to add that additional configuration that's related to sinks and outside of scope of the map modules.

👍

Some thoughts:

From the indexer side: I feel like this is focusing too much on some kind of one-click managed solution that is likely not feasible in many cases. Because it requires indexers not only to run all kind of managed services (Prometheus/Grafana/PostgreSQL/...) but it requires all of them to figure out automated deployments of spkgs metadata (for example I now need to figure out a way on how to auto configure Grafana to use some dashboard from the spkgs). It also requires a lot of security research (are we vulnerable from malicious input in the metadata or grafana.json dashboards for example?). And then you might even need another layer wrapped around the spkgs to do things like secret injections (where do you put your Google Api token for example when deploying a sheets-sink.spkgs?) or potential deployment configurations (how many virtual cores should be assigned to the deployment).
From the Substream developer side: I don't currently see a way how to make this easily deployable for myself. I want to have an easy and quick way to set this up locally for testing and development (including necessary dependencies such as PostgreSQL). I also want an easy way for me to deploy this to my own servers (without having to figure out how to set up my own runtime for deploying spkgs bundles).

I feel like the answer to both sides is likely Docker. Don't think sandboxing is too big of a deal if we want to have a cloud solution, we could just deploy Substreams to a VPS on a cloud provider for example. That way each substream deployment is contained in it's own virtual machine, no access to our internal networks.

This was a previous layout:

specVersion: v0.1.0
package:
  name: mysink
  version: v2.3.2

protobuf:
  files:
    - sf/mycustom/v1/service.proto
  importPaths:
    - ./proto

source:
  package: my-substreams-v1.0.2.spkg
  module: kv_out
  initialBlock: 12_000_000

service:
  kind: wasm
  binary: target/wasm32/release/mycode.wasm
  grpcService: sf.custom.v1.Services 

# service:
#   kind: generic

It's superseded by the comment at: https://github.com/streamingfast/substreams/issues/177#issuecomment-1440958604

streamingfast / substreams