m-a-d-n-e-s-s / madness

Multiresolution Adaptive Numerical Environment for Scientific Simulation
GNU General Public License v2.0
181 stars 62 forks source link

RMI can handle more than 2^16 tasks per rank-to-rank pair #516

Closed evaleev closed 10 months ago

evaleev commented 10 months ago

as of https://github.com/m-a-d-n-e-s-s/madness/commit/3c3c2ba6a71c80a56a68f6e44c6f76f5543d18f5#diff-a10540bf42d111c837a8df49188e5ddb09cc03a68ecf0d444a55c10b09797985 RMI message counters have been 16 bits long; recent apps using MADWorld via https://github.com/ValeevGroup/tiledarray exceed this limit and cause hang due to the broken in-order queue processing, leading to hangs due to unprocessed messages.

the current solution still uses 16-bit counters, with custom sorting logic that can withstand counter overflow/wraparound. a cleaner solution would involve extending counters to 64 bits, which would be enough for apps including up to 2^64 tasks per rank-to-rank pair; however, this would involve more significant redesign of the code.

includes misc cleanup to improve usability/debuggability