Merging the CacheManager library with the DialDictionary

nadrino commented 1 month ago

As we built GUNDAM, the CacheManager introduced by @ClarkMcGrew was taking care of squeezing dials so they can be used for CPU or GPU computation using HEMI.

The DialDictionary already has a structure which handles the dials and their associated cache, although the way it's implemented doesn't allow for HEMI to be applicable.

In order to make sure future additions in the dials library are more straight forward, I'd like to get those two library merged or at least make sure they don't have any overlap. Up until now I used to implement new dials in the DialDictionary but let Clark do the implementation in the CacheManager, which is probably not optimal.

ClarkMcGrew commented 1 month ago

The parallel design is completely intentional and you are making a wrong assumption that the techniques that should be used for CPU are the same as GPU. Don't fall into the trap of confusing the similar meanings of cache. The Cache::Manager is managing the GPU to CPU communication and keeping the on-device caches consistent. The Dial code is caching intermediate results. They are not the same thing.

nadrino commented 1 month ago

Oh I see! So you mean that the CacheManager is library dedicated to the GPU <-> CPU memory transfers, right? Meaning that this library is not indented to run for CPU in production mode, but only for testing GPU calculations?

ClarkMcGrew commented 1 month ago

Why would your run CPU's during production (tongue-in-cheek)?

I think you are talking about the Cache::Manager CPU calculation. There are multiple layers there. First, both calculations come for "free" since CUDA generates both CPU and GPU versions of the functions (so why not provide both). Second, for some kernels, the CPU can be faster (we are not in that category). Third, a CPU is a fall back when there isn't a GPU (I didn't do that that since CPU-only kernels are slower than the raw Dial calculation, so why bother). Fourth, most people are more familiar with debugging on a CPU, so it's easier for them to debug code using the CPU version and then check on the GPU (personally, I find GPU debugging to be easier).

In seriousness though, CPUs and GPUs have very different strengths, and are really two different machines connected by a rather slow "network" connection (3 to 30GB/s depending on the bus). The GPU memory and arithmetic is A LOT faster (but you "cannot" use if statements). For GUNDAM, the GPU is exceptionally well suited to applying event by event weights, and accumulating them into a histogram, and the GPU makes debugging and verification super simple. BTW: The HEMI part of Cache::Manager manages the actual CPU/GPU communications and the rest of the code freezes the Propagator/Event/Dial/Histogram data into a SIMD calculation.

gundam-organization / gundam

Merging the CacheManager library with the DialDictionary #555