Aggregate critter critical path costs before Reduction

Currently, after a BSP step, we iterate over all MPI routines tracked by critter and find the max over five different metrics. This results in factor of NUM_CRITTERS more synchronizations than necessary.

_critter::compute_max_crit(...) should simply fill in the local costs to a window of an array. At the end of that loop, we can perform a single MPI_Allreduce, and then write back each reduced entry to the member variables of the corresponding MPI routine.

huttered40 / critter

Aggregate critter critical path costs before Reduction #9