Closed abourget closed 1 year ago
👍 great proposal @abourget really like the idea of bundled Substreams packages meant to handle all components (Extract,Load,Query)
As for filenames, not sure about the extended types, the reason is because these files will most likely end up as IPFS hashes or randomly named packages, however, usually what doesn't change is the file extension.
My preference would be to not worry too much about the entire filename itself but make sure it's using *. spkgs
(ex: Bundled *.spkg
)
Filenames would look like:
QmPpLvBDJ9TbG7syHGUjGvyiFzYB9H3M5FuZjzWnQovcQx.spkgs
mysubstreamsmod-v1.0.2.spkg
(only substreams map/stores)mysubstreamsmod-v1.0.2.spkgs
(bundled substreams with services)As long as there's a way to inspect the modules & services via:
substreams inspect <package>
substreams info [<manifest_file>]
As for bundling, could look something like this:
substreams bundle [<manifest_file>]
The bundle tool would need to be in each sink, though, unless there's a generalizable packer from sink manifests to a corresponding Protobuf message (that the substreams
CLI doesn't know about initially?).
There are two options:
sink_meta
in the Substreams Package. And perhaps that sink protobuf contains its own query_meta
.sink_meta
+ query_meta
in the Substreams Package.Option 1) means that we can't know if there's a query config in the package, because the substreams
CLI is general, and doesn't know about all the possible sinks. All it can know is that there is a sf.whatever.sink.v1.KV
configured in this package. You'll need to use the sinkkv
toolkit to view the specific configuration. UNLESS the bundle
operation also adds the required protobuf, in which case you could have a JSON view of the sink_meta
.
Option 2) would allow us to know more about what's in the package, know if it's a deployable query layer also. Having that top-level would allow the package to be characterised as "deployable reading software", and the _sinkmeta could be characterised as "deployable writing software" taking its input from a sink.
There may be multiple possible sinks for one substream? How to handle that?
This is a dump of the previous issue's content, for reference. The main comment of the issue will contain what we've decided to go forward with.
Subgraphs are deployable units: packages that can be sent to someone with the proper runtime, and turned into fully fledged service.
Substreams are packageable, but are not deployable units themselves, since they don't offer a query layer. They are only the transformation layer. The sinks are the components that turn a Substreams into something deployable, where with the proper runtime, it can be turned into a fully fledged service (graph-node
being one of them).
I propose that Substreams Packages be augmented with a single field, called sink_meta
of type pbany.Any
at the end of https://github.com/streamingfast/substreams/blob/develop/proto/sf/substreams/v1/package.proto#L10-L22
Adding a single field at the top-level Package, means that a serialized spkg
file could simply be appended with additional metadata.
The substreams-prometheus-sink
reads Substreams output in a certain shape, and writes to Prometheus - a popular time series database. However, certain things need to be known about the data being written - schemas, help strings, declaration of metrics, etc.. The current Package format has no space to accomodate such metadata.
With a conventional field of type Any
(which includes a fully qualified protobuf message name, and serialized bytes for that message), the substreams
CLI could print whether some sink metadata is attached (with its type). Optionally provide details if the sink metadata becomes well known. Sinks could read their Substreams dependencies and configuration from a single location: the spkg
file.
Another example would be PostgreSQL sink:
map
or store
module to provide the schema of the target databasesink_meta
information could convey the SQL schema needed to initialize the database when the sink needs it.The prometheus sink could provide the tooling necessary to take in its manifest (ideally staying close to the Substreams experience, say prometheus-sink.yaml
), and build up its protobuf message and "bundle" it with the spkg
, turning the spkg
package into a deployable unit.
I'm unsure if the extension should change to indicate that an spkg
is now a bundled package. Some possibilities:
mysubstreamsmodule-v1.0.2.spkg
(bundled and unbundled alike)mysubstreamsmodule-v1.0.2.spkgs
meaning bundled with a sink configurationmysubstreamsmodule-v1.0.2.prometheus.spkg
to indicate the spkg is bundled with prometheus configuration?mysubstreamsmodule-v1.0.2.nicegraphs-v2.0.4+prometheus.spkg
to indicate the versions of both layers (transform and load/query)mysubstreamsmodule-v1.0.2.nicegraphs-v2.0.4.prometheus.spkg
? This seems like the most complete.Where an example prometheus manifest (prometheus_sink.yaml
) could look like:
package:
name: nicegraphs
version: v2.0.4
source:
package: my-package.spkg
module: prom_out
#start_block: -1000
metrics:
this_metric: Help string of the metric
this_other_metric: Help string of that metric
labels:
this_label: Meaning of that label
grafana_dashboards: grafana.json
and where the grafana.json
file would be packaged in, and attached to the spkg
.
This package would have everything needed for a successful deployment, as a single deployable unit.
Implementable as:
substreams-sink-prometheus pack ./manifest.yaml
Having specialized runtimes, with Substreams Packages as deployable units seem to be very fitting to our use cases, and warrant the further development of its format.
If we consider Substreams as the transformation layer, and the Prometheus insertion as the "load" layer. We should think of the Grafana dashboards as the "query" layer, separate from "load". If we stretch our thinking here, we should imagine how to accomodate that additional query_metadata
field right away.
Filenames could look like:
mysubstreamsmod-v1.0.2.myprommetrics-v3.2.3.myhugedashboards-v4.3.2.prometheus.grafana.spkg
mysubstreamsmod-v1.0.2.myprommetrics-v3.2.3.prometheus.myhugedashboards-v4.3.2.grafana.spkg
mysubstreamsmod-v1.0.2_myprommetrics-v3.2.3_myhugedashboards-v4.3.2.spkg
(and leave the users name things meaningfully)mysubstreamsmod-v1.0.2_myprommetrics-v3.2.3_myhugedashboards-v4.3.2.spkg+prom+grafana
That's pretty unwieldy, but we can imagine someone wanting to just update the dashboards, without redeploying the sink with its config (not reload postgres from scratch, etc..)
/cc @DenisCarriere @azf20 @fubhy
Example flow from developer to indexer operation, wanting to deploy a kvsink
with the fixed gRPC endpoint:
kv_out
following https://github.com/streamingfast/substreams-sink-kvsubstreams.yaml
manifest with included sink configuration:specVersion: v0.1.0
package:
name: my-eth-transfers-kvsink
version: v2.3.2
imports:
mod: ./substreams.yaml
kvsink: https://github.com/releases/substreams-sink-kv-v1.0.1.spkg
protobuf:
files:
- sf/custom/v1/service.proto
importPaths:
- ./proto
sink:
@type: sf.substreams.sink.kv.v1.WASMQueryConfig
inputModule: mod:kv_out
initialBlock: 12_000_000
# @ for text files, @@ for binary files, \@ if you want an explicit at sign.
wasmQueryModule: @@target/wasm32/release/mycode.wasm
grpcService: sf.custom.v1.Service
sink:
@type: sf.substreams.sink.kv.v1.GenericConfig
inputModule: mod:kv_out
initialBlock: 12_000_000
For a gRPC service like:
package sf.custom.v1
service Services {
rpc SayHello(HelloRequest) returns (HelloReply) {}
}
$ substreams-sink-kv pack sink-kv.yaml
Writing mysink-v2.0.1.spkg
The spkg
file contains the sf.substreams.sink.kv.v1.GenericService
configuration protobuf message in the sink_config
field, as well as the Substreams modules.
substreams inspect ./mysink-v2.0.1.spkg | grep ^target_sink |grep sf.substreams.kv.v1.SinkGenericQuery
and deploys or not, depending on whether he knows how to deploy such a unit:
if $? != 0; do echo Unsupported sink; exit 1; done
NETWORK=$(substreams tools network-env-var ./mysink-v2.0.1.spkg) // transforms the `ethereum:mainnet` value in `target_network` into ETHEREUM_MAINNET, and resolves any aliases
ENVVAR=MY_ENDPOINTS_CONFIGS_$NETWORK
ENDPOINT=${!ENVVAR}
if $ENDPOINT == ""; do echo Unsupported network $NETWORK; exit 1; done
substreams-sink-kv run -e $ENDPOINT mysink-v2.0.1.spkg
Of course, any more sophisticated deployment machinery can be built around, but the simple case is possible.
No additional comments, this plan to package sinks into deployable units sounds like a great idea.
There will be some additional tooling required to "pack" in the various sinks, but won't shouldn't be a barrier.
Simple manifests (ex: sink-kv.yaml
) shouldn't be too hard for users to add that additional configuration that's related to sinks and outside of scope of the map modules.
👍
Some thoughts:
From the indexer side: I feel like this is focusing too much on some kind of one-click managed solution that is likely not feasible in many cases. Because it requires indexers not only to run all kind of managed services (Prometheus/Grafana/PostgreSQL/...) but it requires all of them to figure out automated deployments of spkgs metadata (for example I now need to figure out a way on how to auto configure Grafana to use some dashboard from the spkgs). It also requires a lot of security research (are we vulnerable from malicious input in the metadata or grafana.json dashboards for example?). And then you might even need another layer wrapped around the spkgs to do things like secret injections (where do you put your Google Api token for example when deploying a sheets-sink.spkgs?) or potential deployment configurations (how many virtual cores should be assigned to the deployment).
From the Substream developer side: I don't currently see a way how to make this easily deployable for myself. I want to have an easy and quick way to set this up locally for testing and development (including necessary dependencies such as PostgreSQL). I also want an easy way for me to deploy this to my own servers (without having to figure out how to set up my own runtime for deploying spkgs bundles).
I feel like the answer to both sides is likely Docker. Don't think sandboxing is too big of a deal if we want to have a cloud solution, we could just deploy Substreams to a VPS on a cloud provider for example. That way each substream deployment is contained in it's own virtual machine, no access to our internal networks.
This was a previous layout:
specVersion: v0.1.0
package:
name: mysink
version: v2.3.2
protobuf:
files:
- sf/mycustom/v1/service.proto
importPaths:
- ./proto
source:
package: my-substreams-v1.0.2.spkg
module: kv_out
initialBlock: 12_000_000
service:
kind: wasm
binary: target/wasm32/release/mycode.wasm
grpcService: sf.custom.v1.Services
# service:
# kind: generic
It's superseded by the comment at: https://github.com/streamingfast/substreams/issues/177#issuecomment-1440958604
New units of deployments:
graph-node
), and turned into a long-running service.graph-node
being one of them).What is needed:
spkg
) that defines the source of its data, by definition.spkg
format to accommodate an optionalsink_config
configuration.target_network
parameter, required by any sink that wants to issue a Substreams request.spkg
s into optionally deployable units.Concretely that means that the Package message definition be augmented in this way:
Benefits:
spkg
file being a self-describing message.spkg
by thesubstreams
CLI tool. Both in terms of Protobuf schema definition as well as sink metadata.substreams inspect
) allows an Indexer to discover how anspkg
can be deployed. Even some simplebash
scripts.spkg
to tack on some additionalproto_files
and asink_config
, making very easy to take a Substreams package, and configure it for a given sink.Prometheus example
The
substreams-prometheus-sink
reads Substreams output in a certain shape, and writes to Prometheus - a popular time series database. However, certain things need to be known about the data being written - schemas, help strings, declaration of metrics, etc.. The current Package format has no space to accomodate such metadata.Tooling
Example flow from developer to indexer operation, wanting to deploy a
kvsink
with the fixed gRPC endpoint:kv_out
following https://github.com/streamingfast/substreams-sink-kvsubstreams.yaml
manifest with included sink configuration:For a gRPC service like:
The
spkg
file contains thesf.substreams.sink.kv.v1.GenericService
configuration protobuf message in thesink_config
field, as well as the Substreams modules.The prometheus sink could provide the tooling necessary to take in its manifest (ideally staying close to the Substreams experience, like
prometheus-sink.yaml
), and build up its protobuf message and "bundle" it with thespkg
, turning thespkg
package into a deployable unit.Where an example prometheus manifest (
prometheus_sink.yaml
) could look like:and where the
grafana.json
file would be packaged in, and attached to thespkg
.This package would have everything needed for a successful deployment, as a single deployable unit.
Implementable as:
substreams pack ./substreams-prometheus.yaml
You can then imagine a Kubernetes operator that is passed down such an spkg, and spins up services automatically.
Key/value sink specs:
Postgres:
MongoDB:
Standardization of tools
Ideally, each tool standardizes around the same verb
pack
to take its manifest, and turn it into anspkg
.to kickstart some conventions.
substreams inspect
We want to ensure
inspect
outputs something consumable by scripts, structured in JSON, or as it is right now (in a sort of flag key / value display), so that someone can do:and do some simple env variable replacement, and parameter passing.
In JSON:
inspect
would also decode anyAny
fields, recursively, if they are available in theproto_files
specs of the Package itself.