helium / blockchain-core

Apache License 2.0
214 stars 85 forks source link

Streams with parallel processing, lazy filtering and random sampling #1389

Open xandkar opened 2 years ago

xandkar commented 2 years ago

TL;DR

This is a prerequisite to #1334 which also adds optimal random sampling of RocksDB and other reusable stream operations.

Summary

  1. generalizes the stream type as data_stream:t/1, initially used in streaming blocks from ledger snapshot;
  2. implements general sequence operations on streams:
    • eager, immediately consuming the whole stream:
    • foreach
    • fold
    • sample: implementing an optimal reservoir sampling algorithm - can be used in all cases where we need to pick a random record from RocksDB;
    • pmap_to_bag: parallel maps a stream into a list with arbitrary re-order - this was used to parallelize build_hash_chain for a ~6x speedup in #1334
    • lazy, deferring stream consumption until access time:
    • map
    • filter (can be used in https://github.com/helium/erlang-libp2p/pull/437, see: blockchain_rocks_SUITE:t_sample_filtered/1)
    • append (allowing chaining N streams into one)
  3. exposes RocksDB access as a stream ( blockchain_rocks), offering all of the above operations.

Pitch

Our data access patterns have the same broad shape - stream processing; and the same specific patterns, including, but not limited to:

While we've accumulated adhoc solutions, they:

This PR canonicalizes a stream abstraction adaptable to any of those needs, implements the common stream processing patterns and adapts them to RocksDB, testing at every step.

Reservations

Something like erlang-hel (standing for Helium Library) can be a good home.