Need to rethink how we use CUDA backend in Spatter

Currently the Spatter CUDA backend assumes the ability to partition the pattern array all the way down to a single thread block (size of 1KB elements - 8 byte elements would be 8KB) so that each thread block can keep its pattern array in shared memory cache. The target/source buffer remains in main memory. This doesn't match the Flag and xRAGE use cases, we need to have an option for the the pattern array to be main memory resident and shared by all thread blocks. We also need to ensure that scatters use atomics to avoid races on writes.

lanl / benchmarks

Need to rethink how we use CUDA backend in Spatter #51