streamingfast / substreams

Powerful Blockchain streaming data engine, based on StreamingFast Firehose technology.
Apache License 2.0
164 stars 45 forks source link

Support sink and network metadata within an `spkg` #177

Closed abourget closed 1 year ago

abourget commented 1 year ago

New units of deployments:

What is needed:

  1. Each sink needs a Package (spkg) that defines the source of its data, by definition.
  2. Most sinks require metadata to be deployable autonomously (e.g. postgres needs some schemas, prometheus needs metrics descriptions, csv needs lists of tables / fields).
  3. Instead of inventing a different file format for each sink, we will extend the spkg format to accommodate an optional sink_config configuration.
  4. While we're there, we will add a target_network parameter, required by any sink that wants to issue a Substreams request.
  5. This turns spkgs into optionally deployable units.

Concretely that means that the Package message definition be augmented in this way:

import "google/protobuf/any.proto";

message Package {
  repeated google.protobuf.FileDescriptorProto proto_files = 1;
  reserved 2; // In case protosets add a field some day.
  ...
  uint64 version = 5;
  sf.substreams.v1.Modules modules = 6;
  repeated ModuleMetadata module_meta = 7;
  repeated PackageMetadata package_meta = 8;

  ////// ADD THESE NEW FIELDS:

  // Source network for Substreams to fetch its data from.
  string network = 9;

  google.protobuf.Any sink_config = 10;
  string sink_module = 11;
}

Benefits:

Prometheus example

The substreams-prometheus-sink reads Substreams output in a certain shape, and writes to Prometheus - a popular time series database. However, certain things need to be known about the data being written - schemas, help strings, declaration of metrics, etc.. The current Package format has no space to accomodate such metadata.

Tooling

Example flow from developer to indexer operation, wanting to deploy a kvsink with the fixed gRPC endpoint:

specVersion: v0.1.0
package:
  name: my-eth-transfers-kvsink
  version: v2.3.2

imports:
  mod: ./substreams.yaml
  kvsink: https://github.com/releases/substreams-sink-kv-v1.0.1.spkg

protobuf:
  files:
    - sf/custom/v1/service.proto
  importPaths:
    - ./proto

network: goerli

sink:
  module: mod:kv_out
  type: sf.substreams.sink.kv.v1.WASMQueryConfig
  config:
    initialBlock: 12_000_000
    # @ for text files, @@ for binary files, \@ if you want an explicit at sign.
    wasmQueryModule: @@target/wasm32/release/mycode.wasm
    grpcService: sf.custom.v1.Service

### or:

#sink:
#  module: mod:kv_out
#  type: sf.substreams.sink.kv.v1.GenericConfig
#  config:
#    initialBlock: 12_000_000

For a gRPC service like:


package sf.custom.v1
service Services {
  rpc SayHello(HelloRequest) returns (HelloReply) {}
}
$ substreams pack sink-kv.yaml
Writing mysink-v2.0.1.spkg

The spkg file contains the sf.substreams.sink.kv.v1.GenericService configuration protobuf message in the sink_config field, as well as the Substreams modules.

substreams sink-config ./mysink-v2.0.1.spkg
{"@type": "sf.substreams.sink.kv.v1.GenericService",
 "@module": "mod:kv_out",
 "@network": "ethereum:mainnet",
 "initialBlock": "123"
 "wasmQueryBinary": "F76G8720392831..02938102938=="
}

The prometheus sink could provide the tooling necessary to take in its manifest (ideally staying close to the Substreams experience, like prometheus-sink.yaml), and build up its protobuf message and "bundle" it with the spkg, turning the spkg package into a deployable unit.

Where an example prometheus manifest (prometheus_sink.yaml) could look like:

specVersion: v0.1.0
package:
  name: nicegraphs
  version: v2.0.4

imports:
  streams: ./substreams.yaml
  sink: https://github.com/releases/prometheus-sink-v1.2.0.spkg

sink:
  module: streams:prom_out
  type: sf.susbtreams.sink.prometheus.v1.PrometheusAndGrafana
  config:
    initialBlock: -1000
    metrics:
      "this_metric": Help string of the metric
      "this_other_metric": Help string of that metric
    labels:
      "this_label": Meaning of that label
    grafana_dashboards: @grafana.json

and where the grafana.json file would be packaged in, and attached to the spkg.

This package would have everything needed for a successful deployment, as a single deployable unit.

Implementable as:

substreams pack ./substreams-prometheus.yaml

You can then imagine a Kubernetes operator that is passed down such an spkg, and spins up services automatically.

Key/value sink specs:

Postgres:

message sf.substreams.postgres.v1.HasuraQueryService {
  string source_module = 2;
  string schema = 1;
  string override_initial_block = 3;
  string hasura_config = 4;
}
message sf.substreams.postgres.v1.WASMQueryService {
  string schema = 1;
  string source_module = 2;
  string override_initial_block = 3;
  // wasm exports: "pg_query"
  bytes wasm_query_module = 4;
  string grpc_service = 5;
}

MongoDB:

message sf.substreams.mongodb.v1.WASMQueryService {
  string source_module = 2;
  // wasm exports: "mongo_query"
  bytes wasm_query_module = 4;
 }

Standardization of tools

Ideally, each tool standardizes around the same verb pack to take its manifest, and turn it into an spkg.

substreams-sink-postgres pack ./my-postgres-manifest.yaml
substreams-sink-prometheus pack ./my-prom-manifest.yaml
substreams-sink-kv pack ./my-kv-manifest.yaml

to kickstart some conventions.

substreams inspect

We want to ensure inspect outputs something consumable by scripts, structured in JSON, or as it is right now (in a sort of flag key / value display), so that someone can do:

substreams inspect my.spkg | grep -A15 "^sink_config:"

and do some simple env variable replacement, and parameter passing.

In JSON:

SINK_TYPE=$(substreams inspect --json my.spkg | jq .sink_config.@type)
if $SINK_TYPE eq "my.sink.v1.Target"; do ...

inspect would also decode any Any fields, recursively, if they are available in the proto_files specs of the Package itself.

DenisCarriere commented 1 year ago

👍 great proposal @abourget really like the idea of bundled Substreams packages meant to handle all components (Extract,Load,Query)

As for filenames, not sure about the extended types, the reason is because these files will most likely end up as IPFS hashes or randomly named packages, however, usually what doesn't change is the file extension.

My preference would be to not worry too much about the entire filename itself but make sure it's using *. spkgs (ex: Bundled *.spkg)

Filenames would look like:

As long as there's a way to inspect the modules & services via:

substreams inspect <package>
substreams info [<manifest_file>]

As for bundling, could look something like this:

substreams bundle [<manifest_file>]
abourget commented 1 year ago

The bundle tool would need to be in each sink, though, unless there's a generalizable packer from sink manifests to a corresponding Protobuf message (that the substreams CLI doesn't know about initially?).

There are two options:

  1. Either we have a singlesink_meta in the Substreams Package. And perhaps that sink protobuf contains its own query_meta.
  2. or we have sink_meta + query_meta in the Substreams Package.

Option 1) means that we can't know if there's a query config in the package, because the substreams CLI is general, and doesn't know about all the possible sinks. All it can know is that there is a sf.whatever.sink.v1.KV configured in this package. You'll need to use the sinkkv toolkit to view the specific configuration. UNLESS the bundle operation also adds the required protobuf, in which case you could have a JSON view of the sink_meta.

Option 2) would allow us to know more about what's in the package, know if it's a deployable query layer also. Having that top-level would allow the package to be characterised as "deployable reading software", and the _sinkmeta could be characterised as "deployable writing software" taking its input from a sink.

matthewdarwin commented 1 year ago

There may be multiple possible sinks for one substream? How to handle that?

abourget commented 1 year ago

This is a dump of the previous issue's content, for reference. The main comment of the issue will contain what we've decided to go forward with.


Subgraphs are deployable units: packages that can be sent to someone with the proper runtime, and turned into fully fledged service.

Substreams are packageable, but are not deployable units themselves, since they don't offer a query layer. They are only the transformation layer. The sinks are the components that turn a Substreams into something deployable, where with the proper runtime, it can be turned into a fully fledged service (graph-node being one of them).

I propose that Substreams Packages be augmented with a single field, called sink_meta of type pbany.Any at the end of https://github.com/streamingfast/substreams/blob/develop/proto/sf/substreams/v1/package.proto#L10-L22

Adding a single field at the top-level Package, means that a serialized spkg file could simply be appended with additional metadata.

Prometheus example

The substreams-prometheus-sink reads Substreams output in a certain shape, and writes to Prometheus - a popular time series database. However, certain things need to be known about the data being written - schemas, help strings, declaration of metrics, etc.. The current Package format has no space to accomodate such metadata.

With a conventional field of type Any (which includes a fully qualified protobuf message name, and serialized bytes for that message), the substreams CLI could print whether some sink metadata is attached (with its type). Optionally provide details if the sink metadata becomes well known. Sinks could read their Substreams dependencies and configuration from a single location: the spkg file.

Postgres example

Another example would be PostgreSQL sink:

Tooling

The prometheus sink could provide the tooling necessary to take in its manifest (ideally staying close to the Substreams experience, say prometheus-sink.yaml), and build up its protobuf message and "bundle" it with the spkg, turning the spkg package into a deployable unit.

I'm unsure if the extension should change to indicate that an spkg is now a bundled package. Some possibilities:

Where an example prometheus manifest (prometheus_sink.yaml) could look like:

package:
  name: nicegraphs
  version: v2.0.4

source:
  package: my-package.spkg
  module: prom_out
  #start_block: -1000

metrics:
  this_metric: Help string of the metric
  this_other_metric: Help string of that metric
labels:
  this_label: Meaning of that label

grafana_dashboards: grafana.json

and where the grafana.json file would be packaged in, and attached to the spkg.

This package would have everything needed for a successful deployment, as a single deployable unit.

Implementable as:

substreams-sink-prometheus pack ./manifest.yaml

Prior art

Having specialized runtimes, with Substreams Packages as deployable units seem to be very fitting to our use cases, and warrant the further development of its format.

Is this the right abstraction?

If we consider Substreams as the transformation layer, and the Prometheus insertion as the "load" layer. We should think of the Grafana dashboards as the "query" layer, separate from "load". If we stretch our thinking here, we should imagine how to accomodate that additional query_metadata field right away.

Filenames could look like:

That's pretty unwieldy, but we can imagine someone wanting to just update the dashboards, without redeploying the sink with its config (not reload postgres from scratch, etc..)

/cc @DenisCarriere @azf20 @fubhy

abourget commented 1 year ago

Example flow from developer to indexer operation, wanting to deploy a kvsink with the fixed gRPC endpoint:

specVersion: v0.1.0
package:
  name: my-eth-transfers-kvsink
  version: v2.3.2

imports:
  mod: ./substreams.yaml
  kvsink: https://github.com/releases/substreams-sink-kv-v1.0.1.spkg

protobuf:
  files:
    - sf/custom/v1/service.proto
  importPaths:
    - ./proto

sink:
  @type: sf.substreams.sink.kv.v1.WASMQueryConfig
  inputModule: mod:kv_out
  initialBlock: 12_000_000
  # @ for text files, @@ for binary files, \@ if you want an explicit at sign.
  wasmQueryModule: @@target/wasm32/release/mycode.wasm
  grpcService: sf.custom.v1.Service

sink:
  @type: sf.substreams.sink.kv.v1.GenericConfig
  inputModule: mod:kv_out
  initialBlock: 12_000_000

For a gRPC service like:


package sf.custom.v1
service Services {
  rpc SayHello(HelloRequest) returns (HelloReply) {}
}
$ substreams-sink-kv pack sink-kv.yaml
Writing mysink-v2.0.1.spkg

The spkg file contains the sf.substreams.sink.kv.v1.GenericService configuration protobuf message in the sink_config field, as well as the Substreams modules.

substreams inspect ./mysink-v2.0.1.spkg | grep ^target_sink |grep sf.substreams.kv.v1.SinkGenericQuery

and deploys or not, depending on whether he knows how to deploy such a unit:

if $? != 0; do echo Unsupported sink; exit 1; done

NETWORK=$(substreams tools network-env-var ./mysink-v2.0.1.spkg)  // transforms the `ethereum:mainnet`  value in `target_network` into ETHEREUM_MAINNET, and resolves any aliases
ENVVAR=MY_ENDPOINTS_CONFIGS_$NETWORK
ENDPOINT=${!ENVVAR}
if $ENDPOINT == ""; do echo Unsupported network $NETWORK; exit 1; done

substreams-sink-kv run -e $ENDPOINT mysink-v2.0.1.spkg

Of course, any more sophisticated deployment machinery can be built around, but the simple case is possible.

DenisCarriere commented 1 year ago

No additional comments, this plan to package sinks into deployable units sounds like a great idea.

There will be some additional tooling required to "pack" in the various sinks, but won't shouldn't be a barrier.

Simple manifests (ex: sink-kv.yaml) shouldn't be too hard for users to add that additional configuration that's related to sinks and outside of scope of the map modules.

👍

fschoell commented 1 year ago

Some thoughts:

  1. From the indexer side: I feel like this is focusing too much on some kind of one-click managed solution that is likely not feasible in many cases. Because it requires indexers not only to run all kind of managed services (Prometheus/Grafana/PostgreSQL/...) but it requires all of them to figure out automated deployments of spkgs metadata (for example I now need to figure out a way on how to auto configure Grafana to use some dashboard from the spkgs). It also requires a lot of security research (are we vulnerable from malicious input in the metadata or grafana.json dashboards for example?). And then you might even need another layer wrapped around the spkgs to do things like secret injections (where do you put your Google Api token for example when deploying a sheets-sink.spkgs?) or potential deployment configurations (how many virtual cores should be assigned to the deployment).

  2. From the Substream developer side: I don't currently see a way how to make this easily deployable for myself. I want to have an easy and quick way to set this up locally for testing and development (including necessary dependencies such as PostgreSQL). I also want an easy way for me to deploy this to my own servers (without having to figure out how to set up my own runtime for deploying spkgs bundles).

I feel like the answer to both sides is likely Docker. Don't think sandboxing is too big of a deal if we want to have a cloud solution, we could just deploy Substreams to a VPS on a cloud provider for example. That way each substream deployment is contained in it's own virtual machine, no access to our internal networks.

abourget commented 1 year ago

This was a previous layout:

specVersion: v0.1.0
package:
  name: mysink
  version: v2.3.2

protobuf:
  files:
    - sf/mycustom/v1/service.proto
  importPaths:
    - ./proto

source:
  package: my-substreams-v1.0.2.spkg
  module: kv_out
  initialBlock: 12_000_000

service:
  kind: wasm
  binary: target/wasm32/release/mycode.wasm
  grpcService: sf.custom.v1.Services 

# service:
#   kind: generic

It's superseded by the comment at: https://github.com/streamingfast/substreams/issues/177#issuecomment-1440958604