Parallelize index merging

Build	Forward decls	TU count	Indexing time (Ti) (s)	Merging time (Tm) (s)	Ti/Tm
Dev	No	100	16.1	0.6	26.8
Dev	No	1000	363.9	12.9	28.2
Dev	No	4564	1503.8	39.9	37.7
Opt	No	4564	270.8	9.3	29.1
Opt	WIP	4564	306.6	90.9	3.4
Opt	No	6371	1081.9	17.7	61
Opt	Final	6371	1137.1	105.9	10.7

Writing down some design notes here, because I think we should do this, but there are non-trivial data dependencies.

(Nw = number of workers)

One potential design which avoids fine-grained synchronization:

Loop over main indexes, work-stealing parallel-map over documents, accumulate pair<SymbolName, PointerUnion<SymbolInformation *, SymbolInformationBuilder *>> vectors in a Nw x Nw matrix, rows for workers, columns based on hash(SymbolName) % Nw.
Parallel-map over matrix columns, accumulating into an NWorker-length vector of (SymbolName -> PointerUnion<SymbolInformation *, SymbolInformationBuilder *>) maps based on hash(SymbolName) % N.
Work-stealing parallel-map over forward decl indexes, accumulate pair<SymbolName, SymbolInformation *> vectors in an Nw x Nw matrix similar to before.
Parallel-map over forward decl matrix columns. Each worker only performs lookups and modifies pointers based on the matching index in the vector created in step 2. Since the modified pointers are disjoint, there is no need for synchronization.

We could perhaps have some abstraction similar to the WorkerPool in Sorbet, utilizing an SPMC/MPMC queue for implementing work-stealing.

Potential tweaks to above:

Use an absl::Mutex to skip the ceremony in step 3. It seems like the memory footprint of it is just 1 word. We can't directly stick it in the value of a flat_hash_map, since values are required to be MoveConstructible, but a mutex doesn't satisfy that. We could use a separate vector holding pair<absl::Mutex, PointerUnion<...>> and store indexes into that as the map values in step 2.
Use a concurrent hash map and simplify steps 1 and 2 into just performing inserts into the concurrent map. IME, concurrent hash maps don't scale very well with number of threads (e.g. in https://github.com/greg7mdp/parallel-hashmap/blob/master/html/parallel_hashmap.md -- using 8 threads doesn't even increase throughput to 2x), whereas the above suggestion should scale linearly, because it has minimal synchronization. The other practical problem the more popular hash map types are the ones in Folly and TBB, both of which are big dependencies.

sourcegraph / scip-clang