Thoughts on Substream structuring

fschoell commented 1 year ago

Put some thought into how we could structure the Substreams, open for discussions.

The parts I see:

Chain specific code
Chain agnostic code
Database output

1. Chain specific code

This part should transform a chain specific block like eth::Block into chain agnostic BlockStats. Proposal for BlockStats:

message BlockStats {
  string chain = 1;                         // the blockchain we are running on
  int64 block_num = 2;                      // block number for which we hold the block stats 
  google.protobuf.Timestamp timestamp = 3;  // timestamp of the block
  string block_id = 4;                      // block id 

  int64 transaction_count = 5;       // number of successfully executed transactions in this block
  int64 event_count = 6;             // number of successfully executed events in this block
  int64 transactions_per_second = 7; // transaction_count / block_time
  int64 events_per_second = 8;       // event_count / block_time

  bool is_first_block = 9;        // true if this is the first block of the chain
  bool is_first_day_block = 10;   // true if this is the first block of the day
  bool is_first_hour_block = 11;  // true if this is the first block of the hour

  repeated string accounts = 12; // list of unique accounts/wallets used in this block
}

This will allow all following maps to function completely chain agnostic. To achieve the full block stats we likely need 2 maps and one store for each chain. It could look like this

map_partial_blocks(<chain>::Block) -> BlockStats - this map parses <chain>::Block and emits a BlockStats including all fields it is able to fill (fields 1-6 and 12) store_partial_block_stats(BlockStats) - store all partial BlockStats from above by block_num map_full_block_stats(BlockStats, store_partial_block_stats) - it retrieves the BlockStats from above, looks up the previous BlockStats from the store_partial_block_stats and then fills up the remaining fields in BlockStats (it now has block x and block x-1, so it can now calculate the transactions/events per second as it knows how much time has passed since the last block, it also knows if this block is the first per day/hour)

2. Chain agnostic code

The accumulation and max tps/aps stores can now be completely chain agnostic and don't need to handle any logic regarding block times or similar. Those should be common functions which we can just re-use among all Substreams and they would be responsible for aggregating daily/hourly transaction/event counts and also max transactions/events per second.

3. Database output

We might want to think whether we are putting this into it's own Substream so in case we change stuff on the database or add more outputs we don't need to re-sync the full Substream (because the hash has changed).

This would now basically contain the db_out which is responsible to create database_changes whenever:

the max stores have changed (we only need to write this when there are changes, use Deltas<DeltaInt64> here)
to the accumulated tables when we receive a BlockStats that has is_first_day_block / is_first_hour_block set to true (in this case we know we can now emit the accumulated stats from the previous bucket).
to the last_block table on each block (if is_first_block is true we create a new entry, otherwise we update)

Which parts to combine in a Substream

We could have:

1,2,3 is all part of one Substream (for each chain it's own complete Substream, fewest Substreams, we can share common code from 2 through a library)
1,2,3 are all separate Substreams (more Substreams to maintain, but also most caching potential/fewest re-syncs necessary on changes)
1,2 is combined, and 3 is separate (less caching potential, requires all chains to re-sync if one changes, but avoids having to re-sync in case we want to change the database output function)
1 separate 2,3 combined (avoids having to re-sync if one chain specific code changes, but still requires re-sync if we change the database output)

YaroShkvorets commented 1 year ago

Looks good. Do we need block_id for the last_block table in the database?

I like 1+2 combined and 3 separate. Would also let test composability. But feel like it could be a handful to maintain. Tough choice.

fschoell commented 1 year ago

Do we need block_id for the last_block table in the database?

not sure if we need that, probably not.

I like 1+2 combined and 3 separate. Would also let test composability. But feel like it could be a handful to maintain. Tough choice.

Yes so I think 3 is the most likely to change and changes might not necessarily impact the stores. So I think it's useful to bundle this into it's own Substream.

I'm not sure if maintaining is actually a big issue here. Every code base in 2 and 3 should be shared among all Substreams, so the code will actually not differ. And then we could probably just get away with having a simple script that replaces the dependencies in the substreams.yaml file and bundles them? No need to manually create all these Substreams I guess.

DenisCarriere commented 1 year ago

Seems implemented

pinax-network / subtivity-substreams