Substreams Sink to KV (raw)

maoueh commented 1 year ago

We are going to write a sink substreams-sink-kv that will be a plain binary like other sink we have as well as a "library" that is going to be used on a follow-up task (putting a GraphQL API on top of the sink tool).

So the goal of this task is to:

Create a new substreams-sink-kv project (probably copied over from an existing one then trim it down)
Define the output format that the Substreams should output to be consumed by substreams-sink-kv
- A good understanding of https://github.com/streamingfast/substreams/issues/94 is required to define the model properly. For example:
```
KV { 
repeated Entry entries =1;
}
Entry { 
enum Operation {
UNKNOWN = 0;    // Protobuf default should not be used
SET = 1;
}
string key = 1;
bytes value = 2;
Operation operation = 3;
}
```
  Are we going to be able to properly know the value type for the reflection part of the GraphQL (just a consideration).
The sink should leverage kvdb to avoid having to re-invent the wheel
- It seems we might hit some road blocks here, it appears there is some kind of problem between wasm-time and kvdb around the zstd encoder/decoder. I think we should be able to fix it, but might require some research or update of dependencies.
The sink should perform a work similar to work sink like postgres, sink received data to a key/value store, batching and optimized write, handle undo/redo, etc.
Write the documentation for it in the README, just like others sinks.

Keyspace Consideration

prefix k is reserved for keys set by the Substreams: k + user-gend-key -> protobuf encoded message
prefix i will be reserved for indexing (a key pointing to another key): i + user-gend-key -> KEY reference (pointing to a key that starts with k)
prefix x system prefix
- xc => cursor saving
- xr:ffffff123 => tracking of undo/redo segments

This way, one could write indexes, and query indexes without duplicating the data. Also, this allows us to support stores that do not distinguish between an empty value and an absent key.

Important When doing the work, let's keep in mind that it going to be used as a "library" on the follow-up task which is to create a GraphQL API on top of it (https://github.com/streamingfast/substreams/issues/94).

abourget commented 1 year ago

Keyspace:

k + user-gend-key -> protobuf encoded message
i + user-gend-key -> KEY reference (pointing to a key that starts with k)
x system prefix
- xc => cursor saving
- xr:ffffff123 => tracking of undo/redo segments

This way, one could write indexes, and query indexes without duplicating the data. Also, this allows us to support stores that do not distinguish between an empty value and an absent key.

abourget commented 1 year ago

message BatchUpdate {
  repeated KVOperation kv_ops = 1;
//  repeated IndexOperation index_ops = 2;
}
message KVOperation {
  Operation operation = 1;
  KV kv = 2;
}
//message IndexOperation {
//  Operation operation = 1;
//  Index index = 2;
//}
enum Operation {
  UNKNOWN = 0;    // Protobuf default should not be used
  SET = 1;
  DELETE = 2;
}
message KV {
  string key = 1;
  bytes value = 2;
}
//message Index {
//  string key = 1;
//  string pointer = 2;
//}

sduchesneau commented 1 year ago

how is there some kind of problem between wasm-time and kvdb around the zstd encoder/decoder since we don't require wasm-time for any insertion and grpc-based reading?

When we get to prototyping the wasm-based "resolvers", we can figure out those minor issues imho

sduchesneau commented 1 year ago

streamingfast / substreams

Substreams Sink to KV (raw) #93