parallelize extrinsic decoding

👋 first time contributor here, hope my formatting is ok!

When running polkadot-archive with wasm execution & tracing off, i noticed that all or most of the work to archive blocks & extrinsics was happening on a single thread. This became the bottleneck for how quickly we can index a Substrate chain, so i've done a very basic parallel version of extrinsic decoding.

Initial tests have shown a significant speedup when decoding batches of extrinsics (with larger batches parallelizing better), and thus a significant increase in throughput of the archiver. However, this has only been tested on a few machines, and I don't think it's perfect yet - sometimes it pushes %100 cpu use across all cores, sometimes not.

This has been running in production for a few days so i feel fairly confident in it. Not sure if we need to add any configuration options for this - currently it hardcodes splitting the work in 16 sub-batches no matter the size of the input batch. Happy to add any of that if so.

This attempts to parallelize into batches, have each batch accumulate its own results, and them flatten those accumulated results. It may also be viable to completely flatten this & rely on rayon to decide how to parallelize, instead of batching.

paritytech / substrate-archive

parallelize extrinsic decoding #481