Pipeline BTER to generate affinity blocks in parallel W.R.T their calculation

Currently, BTER will calculate an affinity block and then generate it, and then calculate the next, and generate that, and so on.

https://github.com/pnnl/chgl/blob/47dd0239f4428cf0620024cec0043a44e7b5cf0f/src/modules/Generation.chpl#L496-L542

The issue here is that the calculation of the affinity blocks are orthogonal to their generation, and since affinity blocks are by definition disjoint from each other, it would be safe to generate them in parallel. One optimization is that we calculate all affinity blocks in advance and then handle dispatching of them in parallel. This optimization should yield some result but is still restricted by the fact that after generating the affinity block, we have to await termination to begin processing the next affinity block.

The optimization I am thinking of involves a work queue where we have one thread generating the affinity blocks (asynchronously), which will then add meta data about the affinity block to be generated to a work queue. The work queue will be setup such that we have one task per core per locale, and each locale will handle generating the portion of the hypergraph that is allocated on that locale. This has the benefit of adding parallelism to generating these affinity blocks, but also prevents the issue where we have a lot of time spent spawning a task on each core per locale and then waiting for termination over and over. Its from O(NumAffinityBlocks) to O(1).

pnnl / chgl

Pipeline BTER to generate affinity blocks in parallel W.R.T their calculation #14