Memory re-use for nested reductions

In the CGO17 benchmarks MM NVIDIA and NBody NVIDIA, there are nested reductions where memory can be re-used. Currently DPIA cannot model this, meaning that we use twice the required memory and introduce unnecessary copies. This has a significant performance impact (observed roughly 30% to 75% performance loss for these benchmarks depending on the target hardware).

Any comments on that issue @bastian-koepcke ?