Open aleino-nv opened 5 days ago
I believe this is caused by a circular dependency in the shaderCache when using specialized pipelines with Vulkan.
Only Vulkan uses these, and only in the tests you listed plus tests/compute/dynamic-dispatch
, so that explains the repro steps.
The shaderCache is cleared by rhi::vk::~DeviceImpl
, however to get there the device refCount must first drop to zero. But the shaderCache stores specializedPipelines
, which hold a ComputePipelineImpl
, which holds a ShaderProgramImpl
, which holds an m_device
that points back to the shaderCache's owner. It is a Breakable Reference, but nothing breaks it.
shaderCache.specializedPipelines
└─ Pipeline [rhi::vk::ComputePipelineImpl]
└─ m_program rhi::RefPtr<rhi::ShaderProgram>
└─ pointer [rhi::vk::ShaderProgramImpl]
└─ m_device
If this is right, then one option is to clear the cache after each run, eg. around app.finalize()
-- but RHI might not have an API for that yet. Alternatively, disable the cache during tests. Either way caching pipelines from run to run risks non-deterministic effects. If caching is important, a force clear at shutdown then?
Thanks! That sounds like it could be part of the story. WebGPU also has some kind of leak. I didn't try any other backends besides Vulkan and WebGPU yet.
Thanks for clarifying, I didn't test webGpu and spoke too soon :)
I hacked a repro in core slang-rhi: https://github.com/shader-slang/slang-rhi/compare/main...bprb:slang-rhi:bpe_leak
I added a specialization to test-compute-trivial
, disabled all but webGpu, and added an internal counter to rhi::wgpu::DeviceImpl
.
The specialization shader leaves a device refCount of 1 even after gCachedDevices.clear()
, while the original shader ends with 0.
So with just one platform and just one shader (ETA: using --test-case=compute-trivial
), ~Device
appears to never run, and could keep any Device::slangContext.globalSession
alive.
Great progress! Now we have a much simpler repro case, at least for one leak.
Affected tests:
tests/compute/interface-shader-param.slang (wgpu, vk)
(Note: the test is passing before the assert happens)tests/compute/interface-shader-param-in-struct.slang (wgpu)
(Note: the test is passing before the assert happens)This assert is triggering:
session->debugGetReferenceCount()
is returning 7 for wgpu and 4 for vk, fortests/compute/interface-shader-param.slang
.Call stack: