The retention for buffer pools was too low. In the case where a large dynamic upload buffer was long-lived (i.e. multiple frames), it was incredibly likely that any replacement buffers would age out of the pool before being requested again, meaning that the pool effectively became useless. Bump from 5 flushes to 100; this should be sufficient to catch simple multi-frame lifetimes without bloating memory usage too much.
Improve memory layout for resource state storage, placing subresource 0 inline in the structures, instead of appending it to the end. Since subresource 0 is the only state used for buffers, and is also the whole-resource state used for textures, this should be the most common state that's accessed, so keep it close to the other relevant data.
A significant rework of the residency manager. It no longer tracks the DirectX-Graphics-Samples version, and is instead tailored specifically for this codebase. Relevant changes:
Residency management can no longer be disabled, removing some branches in common codepaths.
Move from header-only to header+cpp; no real functional changes, but better hygiene, and resolves include order issues with using code from the rest of the codebase.
Replace the "sync point" construct with a 3-fence-value tuple tracking the 3 translation layer queue timelines. This means:
There's no additional CPU allocations per-flush
There's no additional fences created and stored in private data
There's no additional fence signals per-flush; it had always bothered me that we had 2 signals for every submission, and now we're down to 1
Add caching of budget info. Turns out that querying the budget isn't actually that cheap. Only do it every ~second or so. Since we also get our usage from there, this could result in us transiently being over-budget, but that's probably not a big deal. We'll try to trim down next time we query the usage. We could instead track usage instead of querying it from the budget, but we don't residency-manage all resources, so the tracked size would be too low, so this is probably the better option.
Add caching of fence waits. Submitting redundant fence waits for the residency fence was surprisingly expensive.
Cache some CPU allocations, and replace some manual memory management using std::vector.
Delete the "master set" construct, since this codebase only ever submits one residency set at a time.
Some minor tweaks to the memory layout for resource bind counts.
This change must be done in lock-step with 11on12 and 9on12, which need to be updated due to the residency management changes:
New commits pushed that, when paired with 9on12 changes, further increase perf on discrete GPUs. Combined, I see a ~30% improvement in FPS in the Heaven benchmark at low resolution + low settings (i.e. CPU-bound).
std::vector
.This change must be done in lock-step with 11on12 and 9on12, which need to be updated due to the residency management changes: