DLSS Support Outline - Githubissues

Weather-OS commented 7 months ago

What would it realistically take to implement support for DLSS within this project? Could we make use of DXVK-NVAPI for such feature?

This is a wonderful project and I’m looking forward to contributing some of my own work to it.

Mushoz commented 7 months ago

I am very curious about this question as well. Especially since the README contains the following text:

Realistically, it's now abandoned and will only possibly receive updates to run workloads I am personally interested in (DLSS).

Being able to run DLSS on AMD hardware would be a game-changer for sure, and I am curious what it will take to get there. Amazing project, and I am hoping this won't die now that funding has stopped.

Weather-OS commented 7 months ago

Attempt no.1 (Base ZLUDA build)

Details: DXVK-NVAPI Requires patches in order to bypass AMD GPU Detection, The rest of the libraries work fine. GPU Arch is set to NV_GPU_ARCHITECTURE_TU100.

Observation: The game(s) fail at NVSDK_NGX_D3D12_GetFeatureRequirements. 2 Examples are provided (Minecraft RTX, R&C Rift Apart). Although preliminary initialization does succeed.

vosen commented 7 months ago

I did not take notes last time I poked there, so it might not be 100% accurate, but:

Don't use a full game, rather start with DLSS 2 sample app (https://github.com/NVIDIA/DLSS/releases)
I use D11 api names, but you should try Vulkan on Linux
DLSS calls can be intercepted at nvapi level or CUDA dark api. I did both and I recommend going down the nvapi path. CUDA dark api functions for DLSS is partially mapped out in ZLUDA, but it's a pain. nvapi has all the nice names and it's clear what you are using
Your first goal is to dump DLSS kernels: just break on NvAPI_D3D11_CreateCubinComputeShaderWithName and dump pCubin argument, it's a nice PTX assembly and that's how far I got

The PTX you get looks ostensibly normal, but it's dodgy like this:

cvt.u64.u32   %rd58, %r86;
...
sust.p.2d.v4.b32.zero [%rd58, {%r195,%r196}], {%f576,%f1418,%f1418,%f1418};

Now, my goal was to have nvapi-without-nvapi. Meaning imlementing nvapi functions in terms of CUDA functions, so NvAPI_D3D11_CreateCubinComputeShaderWithName -> cuModuleLoadData + cuModuleGetFunction, NvAPI_D3D11_LaunchCubinShader -> cuLaunchKernel. Might require touching up PTX manually or doing some pre/post processing on the host side. This is all to prove that we fully understand DLSS API flow and there's now action at the distance. Might not be possible.
Next step would be to compile the PTX on ZLUDA. There are some missing instructions. tex. with LOD, mma. or wmma.. At this point they don't have to be fast, just correct
Now you need to figure out how to interop with Vulkan. What you need is an API where you can post an arbitrary ZLUDA kernel to Vulkan command list. This probably does not exist, but since you are on Linux and have the sources to the Vulkan implementation then you can create a temporary implementation. The problem is with emitting the ZLUDA kernel. It must ABI-match whatever your Vulkan implementation is expecting. I've seen amdgpu_cs calling convention in LLVM, maybe that's enough?
zluda_api is the ZLUDA subproject with enough nvapi implemented for the purpose of debugging and development of DLSS

opcod3 commented 6 months ago

Regarding the Vulkan interop it seems hip hould support it, Although I'm not sure if a kernel can be bound to an arbitrary command list as you specify. See example here

D3D12 interop should also be available on windows. There are no HIP samples but the recently released Orochi 2.0 does contain a DX12 sample: here

opcod3 commented 6 months ago

I ran some tests and want to report my findings which are very bare-bones.

I tried to run the DLSS sample that was linked on an AMD gpu on linux. The first issue is that vulkan will require a few extensions that AMD GPUs do not have:

VK_NVX_binary_import - Allows applications to import CuBIN binaries and execute them.
VK_NVX_image_view_handle - Allows applications to query an opaque handle from an image view for use as a sampled image or storage image

I patched the source to bypass those checks, along with some other changes:

patching NGXWrapper::IsFeatureSupported to return always true
setting deviceParams.swapChainBufferCount = 3;
patching surfaceCaps.maxImageCount < m_DeviceParams.swapChainBufferCount to surfaceCaps.maxImageCount != 0 && surfaceCaps.maxImageCount < m_DeviceParams.swapChainBufferCount
disabling .setBufferDeviceAddressMultiDevice(true);
changing shadow map and depth buffer formats from nvrhi::Format::D24S8 to nvrhi::Format::D32S8 (fixes broken rendering)

After this I finally got the sample running, sadly I did not get any calls to NVApi. This is likely because NGX does a feature check and refuses to initialize on unsupported GPUs. At this point I have some questions:

Is it possible to spoof GPU and supported vulkan extensions? (I tried MESA_EXTENSION_OVERRIDE but no luck I assume It's OpenGL only)
Has anyone managed to intercept DLSS-related NVAPI calls on an AMD GPU?
What do you think the best strategy going forward would be?
Is the situation simpler with DirectX?

I think there are three options:

Find out how to patch the necessary checks in NGX
Find a way to spoof vulkan extensions and GPU
Modify the radv/mesa to report those extensions as available

Update:

Compiling mesa from source was surprisingly easy. I updated radv to report it supports the two extensions mentioned above. I also figured out on linux the architecture check is done using libnvidia-ml instead of NVAPI. I updated the zluda implementation of that lib to report my card as turing.

This got me further but DLSS support is still not being reported. I feel NVSDK_NGX_VULKAN_GetFeatureRequirements is what's locking up the system. I'll try to find a way to enable some more logging in NGX, maybe it will shed some light on what's going wrong.

Update 2:

After running the demo with logging by using the __NGX_LOG_LEVEL=10 environment variable I was able to pinpoint why the application is failing.

These are the relevant snippets:

[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkCreateCuModuleNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkCreateCuFunctionNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkDestroyCuModuleNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkDestroyCuFunctionNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkCmdCuLaunchKernelNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkGetImageViewHandleNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkGetImageViewAddressNVX

[tid:124644461002368][CreateKernel:388] Error: vkCreateCuModuleNVX() and/or vkCreateCuFunctionNVX() functions are not available. Request VK_NVX_binary_import extension

This brings us to a point where we have to implement these functions into radv. Before doing that I thought it would be wise to test the Vulkan/HIP interop example I linked to above. And sadly after compiling the program crashes with the following error:

 hipVulkan: /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.2/rocclr/device/rocm/rocdevice.hpp:241: virtual bool roc::NullDevice::importExtSemaphore(void**, const amd::Os::FileDesc&, amd::ExternalSemaphoreHandleType): Assertion `false && "ShouldNotReachHere()"' failed.

It appears hipImportExternalSemaphore and a few other functions are not implemented in the rocm clr backend but only in the PAL one. Which I was unable to easily compile, so it will be for another day. I opened an issue about it https://github.com/ROCm/clr/issues/72

It is likely the HIP/Vulkan interop functions require a special build of rocm-clr and AMDVLK drivers instead of RADV. I guess the next step will be attempting to get that working, followed by compiling a custom version of AMDVLK that supports the required NVX extensions.

Windows might also be a a better platform to target, as I bet the interop will work. But I have no idea if there is a way to inject extensions into the Vulkan/DX12 implementation.

opcod3 commented 4 months ago

I managed to get a vulkan layer 'working', at least I can get calls passed from the DLSS example to ZLUDA. I now got the PTX dumped.

I had to implement/stub the following instructions:

tex.base
sust.p

I am now missing the following:

cvt.rn.f16x2.f32
ldmatrix
mma
fma.relu
cp.async.cg

@vosen My question now is about cvt with f16x2. What do you think is the best way to implement it, as it has 3 arguments instead of the usual two for cvt.

Is it better to create a new struct like this:

pub struct Arg3Cvt<P: ArgParams> {
    pub dst: P::Operand,
    pub src1: P::Operand,
    pub src2: option<P::Operand>,
}

and substitute it as argument for all versions of cvt. Or is it better to make a new cvtHalf instruction? Or is there a third better way, I'm still learning rust.

vosen commented 4 months ago

Best to decide depending on what sort of LLVM bitcode do we need to emit for it. as::CvtDetails already splits conversion case into separate variants (IntFromInt, FloatFromFloat, IntFromFloat, FloatFromInt) which are handled differently. One more distinct variant is going to fit ok with the rest of the code. On the other hand adding a distinct ast::Instruction variant just for cvt.rn.f16x2.f32 is fine too - there is no reason for other cvt variants to care about the src2 that will always be None and there's already precedence for this sort of codeling in the codebase (e.g. Atom vs AtomCas`).

Both solutions are good, choose whichever turns out to be more convenient.

kr1spy2115 commented 1 month ago

So after rollback is it still possible working on DLSS or is it a far goal to achieve now?

Weather-OS commented 1 month ago

So after rollback is it still possible working on DLSS or is it a far goal to achieve now?

It’s pretty much a pipe dream for the foreseeable future.

vosen / ZLUDA

DLSS Support Outline #87