slang-test: Test context Slang session is leaking

aleino-nv commented 5 days ago

Affected tests:

tests/compute/interface-shader-param.slang (wgpu, vk) (Note: the test is passing before the assert happens)
tests/compute/interface-shader-param-in-struct.slang (wgpu) (Note: the test is passing before the assert happens)

This assert is triggering:

SLANG_API void spDestroySession(SlangSession* inSession)
{
    if (!inSession)
        return;

    Slang::Session* session = Slang::asInternal(inSession);
    // It is assumed there is only a single reference on the session (the one placed
    // with spCreateSession) if this function is called
    SLANG_ASSERT(session->debugGetReferenceCount() == 1);
    // Release
    session->release();
}

session->debugGetReferenceCount() is returning 7 for wgpu and 4 for vk, for tests/compute/interface-shader-param.slang.

Call stack:

KernelBase.dll!00007ffbd9ebfa4c()
[External Code]
slang.dll!Slang::handleSignal(Slang::SignalType type, const char * message) Line 65
    at C:\Users\aleino\workspaces\slang\source\core\slang-signal.cpp(65)
slang.dll!spDestroySession(slang::IGlobalSession * inSession) Line 184
    at C:\Users\aleino\workspaces\slang\source\slang\slang-api.cpp(184)
slang-test.exe!TestContext::~TestContext() Line 114
    at C:\Users\aleino\workspaces\slang\tools\slang-test\test-context.cpp(114)
slang-test.exe!innerMain(int argc, char * * argv) Line 4922
    at C:\Users\aleino\workspaces\slang\tools\slang-test\slang-test-main.cpp(4922)
slang-test.exe!main(int argc, char * * argv) Line 4928
    at C:\Users\aleino\workspaces\slang\tools\slang-test\slang-test-main.cpp(4928)
[External Code]

bprb commented 5 days ago

I believe this is caused by a circular dependency in the shaderCache when using specialized pipelines with Vulkan.

Only Vulkan uses these, and only in the tests you listed plus tests/compute/dynamic-dispatch, so that explains the repro steps.

The shaderCache is cleared by rhi::vk::~DeviceImpl, however to get there the device refCount must first drop to zero. But the shaderCache stores specializedPipelines, which hold a ComputePipelineImpl, which holds a ShaderProgramImpl, which holds an m_device that points back to the shaderCache's owner. It is a Breakable Reference, but nothing breaks it.

shaderCache.specializedPipelines
└─ Pipeline [rhi::vk::ComputePipelineImpl]
   └─ m_program     rhi::RefPtr<rhi::ShaderProgram>
      └─ pointer    [rhi::vk::ShaderProgramImpl]
         └─ m_device

If this is right, then one option is to clear the cache after each run, eg. around app.finalize() -- but RHI might not have an API for that yet. Alternatively, disable the cache during tests. Either way caching pipelines from run to run risks non-deterministic effects. If caching is important, a force clear at shutdown then?

aleino-nv commented 5 days ago

Thanks! That sounds like it could be part of the story. WebGPU also has some kind of leak. I didn't try any other backends besides Vulkan and WebGPU yet.

bprb commented 4 days ago

Thanks for clarifying, I didn't test webGpu and spoke too soon :)

I hacked a repro in core slang-rhi: https://github.com/shader-slang/slang-rhi/compare/main...bprb:slang-rhi:bpe_leak

I added a specialization to test-compute-trivial, disabled all but webGpu, and added an internal counter to rhi::wgpu::DeviceImpl. The specialization shader leaves a device refCount of 1 even after gCachedDevices.clear(), while the original shader ends with 0.

So with just one platform and just one shader (ETA: using --test-case=compute-trivial), ~Device appears to never run, and could keep any Device::slangContext.globalSession alive.

aleino-nv commented 4 days ago

Great progress! Now we have a much simpler repro case, at least for one leak.

shader-slang / slang

slang-test: Test context Slang session is leaking #5610