Does programmable blending (aka RT Read) really need a new hardware feature?

gongminmin commented 5 years ago

Programmable blending (aka RT Read) is listed in the roadmap for quite a while, but doesn't have any progress. This is a feature that exists on other platforms for 5+ years. Previously on Apple's platform: GL_APPLE_shader_framebuffer_fetch, later on OpenGL and OpenGLES: GL_EXT_shader_framebuffer_fetch. And they are supported by all mobile GPUs and Intel GPUs. With OpenGL 4.5 or GL_ARB_texture_barrier extension, RT read can also be achieved on NVIDIA and AMD GPUs. However, HLSL/Windows doesn't have it. So even when a hardware supports RT read, apps can't utilize it.

Furthermore, OpenGL ES 3.2 has the "advanced blend equation" feature, which can set the alpha blending equation to multiply, screen, color burn, exclusion, etc. Without RT read, these blending equations have to be implemented by ping pong RT. On tile based GPUs, every RT switch is a disaster to performance.

It is necessary to have the RT read intrinsic ready, then hardware vendors can prepare their drivers to supports it. Hardware modification is not required for reading from RT.

kingofthebongo2008 commented 5 years ago

Hello Minmin,

As far as i understand. The scenarios that you describe can be very effectively done with compute shaders. dxc has its roots in discrete desktop gpus. it did not get so into the mobile space so far. May be this will change in the future Overall the programming is moving to compute workloads. And may be there is a need a system on dxc with voting, so people can vote on features.

In reality, if would be so keen on the feature. I would do it and offer a pull request to dxc. I have done this in the past with some bug reports or small bug fixes. This is the good thing about the public code.

On Tue, 25 Jun 2019 at 05:21, Minmin Gong notifications@github.com wrote:

Programmable blending (aka RT Read) is listed in the roadmap for quite a while, but doesn't have any progress. This is a feature that exists on other platforms for 5+ years (On Apple's platform: GL_APPLE_shader_framebuffer_fetch, on OpenGL and OpenGLES: GL_EXT_shader_framebuffer_fetch, and they are supported by all mobile GPUs and Intel GPUs). However, HLSL/Windows doesn't have it. So even when Qualcomm's GPUs all supports RT read, Windows on arm can't utilize it.

Furthermore, OpenGL ES 3.2 has the "advanced blend equation" feature, which can set the alpha blending equation to multiply, screen, color burn, exclusion, etc. Without RT read, these blending equations have to be implemented by ping pong RT. On tile based GPUs, every RT switch is a disaster to performance.

It is necessary to have the RT read intrinsic ready, then hardware vendors can prepare their drivers to supports it, instead of waiting for hardware modification.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/DirectXShaderCompiler/issues/2293?email_source=notifications&email_token=AAIXS4PLMIKEO4KJ5UJZGVLP4F6I3A5CNFSM4H3DY2Z2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G3NW6VQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIXS4MWUL752FJFQP7E7ATP4F6I3ANCNFSM4H3DY2ZQ .

gongminmin commented 5 years ago

Thanks. Yes, some functionality can be replaced by compute shader. However, if you need a programmable blending after rasterization, you need one more pass for compute shader. Another way is creating RT with unordered flag, and read/write to it in PS via UAV. But depth stencil test is ignored in this case.

On desktop, there are also requirements of advanced blending. For examples, designers like to use blending feature in photoshop to compose UI, but when mapping to code, they can't be implemented in one pass.

My key point is, on other platforms, there are ways to do RT read on existing hardware. It shouldn't require new hardware.

zhaijialong commented 5 years ago

+1 for supporting RT read.

As more and more projects choose to use DXC + spirv-cross to generate their GLSL/MSL for mobie platforms(we does use that combination in our in-house engine), it will be great to have framebuffer fetch(and/or PLS) support in HLSL.

AFAIK next gen Intel iGPU will be tile based, so this feature will be very important for good performance on PC platform too.

Degerz commented 5 years ago

You do realize that even with texture barrier you're still effectively doing a ping-pong, right OP ?

https://docs.microsoft.com/en-us/windows/desktop/direct3d11/rasterizer-order-views

Also, if you want to do "programmable blending" then you can use HLSL ROVs to a similar effect ...

gongminmin commented 5 years ago

Hi @Degerz ,

No, with texture barrier, you can read from and write to the same texture (with some restriction). It's not ping-pong from API point of view.

I know PS + UAV can do programmable blending. I had a prototype of it. But that method has its own restriction, check my comment about depth stencil test. ROV even requires newer hardware. And it's performance can be very different than framebuffer fetch on tile based GPUs.

Degerz commented 5 years ago

@gongminmin Could the early depth/stencil tests be emulated in the shader using HLSL's wave operations, specifically the quad-wide shuffle operations ?

Using a texture barrier is a bad idea since it causes a flush on the GPU and I don't think it works in the case of self-intersecting geometry either. ROV might be the only performant way to do programmable blending currently even if it's only available the latest GPUs.

Also, render target reads might not be supported on other vendors such as AMD or Nvidia.

gongminmin commented 5 years ago

It's still possible to implement depth/stencil tests in shader. But it's far from the convenience and elegance of shader_framebuffer_fetch.

The first example in ARB_texture_barrier is "accomplish a limited form of programmable blending for applications where a single Draw call does not self-intersect, by binding the same texture as both render target and texture and applying blending operations in the fragment shader. " The whole idea of RT readis not trying to solve the self-intersecting geometry problem. The fragment order is still unsorted.

The typical usage of RT read is a deferred shading algorithm optimized for tile based architecture. It allows no RT switch during shading, and read from different RTs in different steps. The other common usage is advanced blending in window composition, such as DWM. With RT read, the blending in composition can be dramatically simplified.

Shader features should be a reasonable union of hardware features from different vendors, and using shader model tiers or capability bits to tell a developer if a feature is available. Since all mobile GPUs support it, next-gen Intel GPUs support it, AMD and NVIDIA GPUs can more or less simulate it, and people want it. Why not just add RT read to DXIL, and push hardware vendors to implement/optimize it.

Degerz commented 5 years ago

Yes, but with RT reads you don't have issues with self-intersecting geometry like texture barrier does which will give you incorrect blending in that case.

As far as a "reasonable union of hardware features" is concerned, RT reads is not that 'reasonable' union since only one desktop GPU vendor supports it and D3D is based on desktop GPUs rather than mobile GPUs so they don't have much relevance.

I see not much purpose in exposing a hardware feature that is effectively only available on one hardware vendor. Even if you do push for other hardware vendors to implement the feature, they don't necessarily have to do it natively in the hardware and can just emulate it inside the driver by modifying the fragment shaders to do early depth/stencil testing ...

gongminmin commented 5 years ago

When vertex texture fetch came out, only NVIDIA supports it. It doesn't stop SM3 to have VTF feature. DXIL is not just for desktop, many people connect dxc and spirv tools to cross compile HLSL to all kinds of platforms. Lacking this important features makes people can't write efficient shaders on tile based GPUs. NVIDIA GPUs are also moving towards tile based. Starting from Maxwell, it first divides the screen into 256 x something huge tiles, then rasterize pixels at the unit of 4x8 blocks. DXIL should be one step ahead of hw vendors, instead of waiting for hw features.

Degerz commented 5 years ago

DXIL is pretty much only made in mind for Desktop and who really cares about tile-based mobile GPUs ? There are other issues with tile-based GPUs like their poor drivers even on Vulkan and currently there's no Vulkan/SPIR-V extension to do a RT read so why even bother ?

Also despite what Nvidia marketing says, the only thing that's "tile-based" about their recent GPUs is how they do the binning process so it's not at all comparable to a mobile GPUs "tile-based deferred rendering" which are sort-middle renderers while Nvidia are still largely immediate mode renderers (sort-last) ...

DXIL being one step ahead of HW vendors would be a disaster considering Microsoft standardized suboptimal ideas in the past before like geometry shaders, stream-output/transform feedback, and tessellation all of which are very anti-mobile GPU features. Not against standardizing RT read but I don't know if other IHVs (AMD & NV) would be as welcoming to the proposal since they'd have to enforce more API ordering constraints with early depth/stencil tests which would be harmful to their highly parallel architectures ...

gongminmin commented 5 years ago

Well, only made in mind for Desktop is something need to be changed. Even only care about Windows, there are ARM devices, such as WOA and Hololens 2. SPIR-V has the extension mechanism, but DXIL doesn't. Adding a new feature to DXIL can take much longer time than adding an SPIR-V extension.

Tile based is trend, desktop GPU vendors are (slowly) moving towards it. No matter it's deferred or not, last fragment output should be available to shader.

(Agree, geometry shader is a disaster, even for desktop :(. On the other hand, stream-output is important before we have compute shader. And tessellation doesn't affect the parallel or mobile much. All recent mobile GPUs support tessellation.)

Degerz commented 5 years ago

I am convinced that there's absolutely no room for mobile GPUs to flourish in D3D and with the advent of standardized mesh shaders, application graphics design between desktop and mobile GPUs are going to become irreconcilable. It will be very difficult for mobile GPUs to cope with high scene complexity in the future. DXIL have extensions as well. Just because IHVs can't update WDDM like they can with their ICDs doesn't mean DXIL can't get extensions. Sure, it's slower but Microsoft have a reliable track record when it comes to conformance testing ...

Tile-based on desktop GPUs isn't all that much comparable in implementation on mobile GPUs. On mobile GPUs such as Apple's independently designed A11/12 GPUs, it's advantageous to deploy a sort-middle architecture so that you can serialize per-tile lists of geometry to save bandwidth. Tiled rendering on Desktop GPUs and newer mobile GPUs is mostly used to bin some triangles per tile in screen-space to exploit spatial coherence so that the caches/on-chip buffers have higher hit rates. This largely makes them immediate mode renderers or sort-last architectures.

Stream-output is a really bad idea on mobile GPUs since it would disable their ability to do binning per-tile because they'd then be forced to execute the entire vertex shader for the whole scene to maintain the output data in order with the input data. As far as tessellation not affecting mobile that much, I know Apple would like to have a word with you there. ;)

If anything, tile-based rendering is dying out in favour of immediate mode because it cannot meet the demands of higher scene complexity so this makes doing RT read in the future less desirable even on mobile GPUs since it hurts parallelism in the long run.

gongminmin commented 5 years ago

The tile-based on desktop GPUs are not the same as mobiles', but it's not just bin triangles, the rasterizer order is also tile-based. For example, on GTX960, drawing a full window quad, the pixel drawing order can be visualized as:

A tile-based mobile GPU (Adreno 430 here) uses much less tile size as the order:

And a pure traditional desktop GPU rasterizes the quad from left to right.

Anyway, the point is, DXIL shouldn't just design for immediate mode GPUs (not even covers desktop GPUs, because of Windows on ARM)

Degerz commented 5 years ago

Anyway, the point is, DXIL shouldn't just design for immediate mode GPUs (not even covers desktop GPUs, because of Windows on ARM)

The strong lobbying from AMD and Nvidia says otherwise ... :wink:

Many Windows on ARM devices probably don't even support bindless textures, tiled resources, conservative rasterization, or even access to barycentric coordinates inside pixel shaders and many more features, etc ... (Do mobile GPUs even support cross-lane operations ?)

DXIL is clearly made for desktop GPUs and should continue to cater to them since mobile graphics technology is a huge laughing stock of the industry when most vendors don't even care to provide working drivers ...

I dearly hope the others (ARM/QCOM) will come to license graphics technology from AMD/NV just like Samsung did ...

BitMD commented 5 years ago

Thanks for the feedback here and interesting discussion. I'll withhold my personal opinion on this subject for now. :)

microsoft / DirectXShaderCompiler

Does programmable blending (aka RT Read) really need a new hardware feature? #2293