Closed Weather-OS closed 1 month ago
I am very curious about this question as well. Especially since the README contains the following text:
Realistically, it's now abandoned and will only possibly receive updates to run workloads I am personally interested in (DLSS).
Being able to run DLSS on AMD hardware would be a game-changer for sure, and I am curious what it will take to get there. Amazing project, and I am hoping this won't die now that funding has stopped.
Attempt no.1 (Base ZLUDA build)
Details: DXVK-NVAPI Requires patches in order to bypass AMD GPU Detection, The rest of the libraries work fine. GPU Arch is set to NV_GPU_ARCHITECTURE_TU100
.
Observation: The game(s) fail at NVSDK_NGX_D3D12_GetFeatureRequirements
. 2 Examples are provided (Minecraft RTX, R&C Rift Apart). Although preliminary initialization does succeed.
I did not take notes last time I poked there, so it might not be 100% accurate, but:
NvAPI_D3D11_CreateCubinComputeShaderWithName
and dump pCubin
argument, it's a nice PTX assembly and that's how far I gotcvt.u64.u32 %rd58, %r86;
...
sust.p.2d.v4.b32.zero [%rd58, {%r195,%r196}], {%f576,%f1418,%f1418,%f1418};
NvAPI_D3D11_CreateCubinComputeShaderWithName
-> cuModuleLoadData
+ cuModuleGetFunction
, NvAPI_D3D11_LaunchCubinShader
-> cuLaunchKernel
. Might require touching up PTX manually or doing some pre/post processing on the host side. This is all to prove that we fully understand DLSS API flow and there's now action at the distance. Might not be possible.tex.
with LOD, mma.
or wmma.
. At this point they don't have to be fast, just correctamdgpu_cs
calling convention in LLVM, maybe that's enough?zluda_api
is the ZLUDA subproject with enough nvapi implemented for the purpose of debugging and development of DLSSRegarding the Vulkan interop it seems hip hould support it, Although I'm not sure if a kernel can be bound to an arbitrary command list as you specify. See example here
D3D12 interop should also be available on windows. There are no HIP samples but the recently released Orochi 2.0 does contain a DX12 sample: here
I ran some tests and want to report my findings which are very bare-bones.
I tried to run the DLSS sample that was linked on an AMD gpu on linux. The first issue is that vulkan will require a few extensions that AMD GPUs do not have:
VK_NVX_binary_import
- Allows applications to import CuBIN binaries and execute them.VK_NVX_image_view_handle
- Allows applications to query an opaque handle from an image view for use as a sampled image or storage imageI patched the source to bypass those checks, along with some other changes:
NGXWrapper::IsFeatureSupported
to return always truedeviceParams.swapChainBufferCount = 3;
surfaceCaps.maxImageCount < m_DeviceParams.swapChainBufferCount
to surfaceCaps.maxImageCount != 0 && surfaceCaps.maxImageCount < m_DeviceParams.swapChainBufferCount
.setBufferDeviceAddressMultiDevice(true);
nvrhi::Format::D24S8
to nvrhi::Format::D32S8
(fixes broken rendering)After this I finally got the sample running, sadly I did not get any calls to NVApi. This is likely because NGX does a feature check and refuses to initialize on unsupported GPUs. At this point I have some questions:
MESA_EXTENSION_OVERRIDE
but no luck I assume It's OpenGL only)I think there are three options:
Update:
Compiling mesa from source was surprisingly easy. I updated radv to report it supports the two extensions mentioned above. I also figured out on linux the architecture check is done using libnvidia-ml instead of NVAPI. I updated the zluda implementation of that lib to report my card as turing.
This got me further but DLSS support is still not being reported. I feel NVSDK_NGX_VULKAN_GetFeatureRequirements
is what's locking up the system. I'll try to find a way to enable some more logging in NGX, maybe it will shed some light on what's going wrong.
Update 2:
After running the demo with logging by using the __NGX_LOG_LEVEL=10
environment variable I was able to pinpoint why the application is failing.
These are the relevant snippets:
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkCreateCuModuleNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkCreateCuFunctionNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkDestroyCuModuleNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkDestroyCuFunctionNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkCmdCuLaunchKernelNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkGetImageViewHandleNVX
[tid:124644461002368][LoadVkEntrypoints:95] Failed to load VK device function: vkGetImageViewAddressNVX
[tid:124644461002368][CreateKernel:388] Error: vkCreateCuModuleNVX() and/or vkCreateCuFunctionNVX() functions are not available. Request VK_NVX_binary_import extension
This brings us to a point where we have to implement these functions into radv. Before doing that I thought it would be wise to test the Vulkan/HIP interop example I linked to above. And sadly after compiling the program crashes with the following error:
hipVulkan: /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.2/rocclr/device/rocm/rocdevice.hpp:241: virtual bool roc::NullDevice::importExtSemaphore(void**, const amd::Os::FileDesc&, amd::ExternalSemaphoreHandleType): Assertion `false && "ShouldNotReachHere()"' failed.
It appears hipImportExternalSemaphore and a few other functions are not implemented in the rocm clr backend but only in the PAL one. Which I was unable to easily compile, so it will be for another day. I opened an issue about it https://github.com/ROCm/clr/issues/72
It is likely the HIP/Vulkan interop functions require a special build of rocm-clr and AMDVLK drivers instead of RADV. I guess the next step will be attempting to get that working, followed by compiling a custom version of AMDVLK that supports the required NVX extensions.
Windows might also be a a better platform to target, as I bet the interop will work. But I have no idea if there is a way to inject extensions into the Vulkan/DX12 implementation.
I managed to get a vulkan layer 'working', at least I can get calls passed from the DLSS example to ZLUDA. I now got the PTX dumped.
I had to implement/stub the following instructions:
tex.base
sust.p
I am now missing the following:
cvt.rn.f16x2.f32
ldmatrix
mma
fma.relu
cp.async.cg
@vosen My question now is about cvt
with f16x2
. What do you think is the best way to implement it, as it has 3 arguments instead of the usual two for cvt.
Is it better to create a new struct like this:
pub struct Arg3Cvt<P: ArgParams> {
pub dst: P::Operand,
pub src1: P::Operand,
pub src2: option<P::Operand>,
}
and substitute it as argument for all versions of cvt
. Or is it better to make a new cvtHalf
instruction?
Or is there a third better way, I'm still learning rust.
Best to decide depending on what sort of LLVM bitcode do we need to emit for it. as::CvtDetails
already splits conversion case into separate variants (IntFromInt, FloatFromFloat, IntFromFloat, FloatFromInt) which are handled differently. One more distinct variant is going to fit ok with the rest of the code. On the other hand adding a distinct ast::Instruction
variant just for cvt.rn.f16x2.f32
is fine too - there is no reason for other cvt
variants to care about the src2 that will always be None
and there's already precedence for this sort of codeling in the codebase (e.g. Atom vs AtomCas`).
Both solutions are good, choose whichever turns out to be more convenient.
So after rollback is it still possible working on DLSS or is it a far goal to achieve now?
So after rollback is it still possible working on DLSS or is it a far goal to achieve now?
It’s pretty much a pipe dream for the foreseeable future.
What would it realistically take to implement support for DLSS within this project? Could we make use of DXVK-NVAPI for such feature?
This is a wonderful project and I’m looking forward to contributing some of my own work to it.