Compute renderer displays garbage at high IRs

FireNX70 commented 1 month ago

Resolutions higher than 6x are very prone to displaying garbage. 6x will display garbage in certain spots, although Zelda Spirit Tracks triggers this pretty easily. 5x seems fine. Tested on a 1080Ti + Win10. Tested with https://github.com/melonDS-emu/melonDS/commit/a72b79a55ad2d61811af11b1b911f6af863f66c2, so https://github.com/melonDS-emu/melonDS/commit/5df83c97c766bff3da8ba5a1504a6a5974467133 doesn't fix it.

Some examples pkmn_bw_6x Pokemon BW Driftveil Bridge 6x pkmn_pt_spot_5x Pokemon Platinum 5x pkmn_pt_spot_6x Pokemon Platinum 6x pkmn_pt_next_to_spot_6x Pokemon Platinum 6x pkmn_pt_next_to_spot_7x Pokemon Platinum 7x tloz_ph_6x Zelda Phantom Hourglass 6x tloz_ph_7x Zelda Phantom Hourglass 7x tloz_st_5x Zelda Spirit Tracks 5x tloz_st_6x Zelda Spirit Tracks 6x

RSDuck commented 1 month ago

Sorry I have Intel iGPU, I can't debug these things.

They probably just need even more work tiles or some other index starts overflowing. I need to come up with a smarter way of allocating the work tile amount or atleast give an option to override it.

FireNX70 commented 1 month ago

I tried doubling MaxWorkTiles and it didn't fix it. Increasing the size of FinalTileMemory didn't do anything either.

takeshineale128 commented 1 month ago

no issues on my gtx 1660 gpu at 16x ir on win 11 with driver version v552.44

FireNX70 commented 1 month ago

I've tested both updating the driver and using a T1000. Neither fixed this. So it's not the driver version or the architecture. I really don't think the OS would make a difference, especially if it's Win10 vs Win11. Were you really testing with the compute renderer?

takeshineale128 commented 1 month ago

opengl

FireNX70 commented 1 month ago

There's two OpenGL renderers. Are you testing with a dev version or are you using 0.9.5?

takeshineale128 commented 1 month ago

0.9.5

RSDuck commented 1 month ago

Increasing the size of FinalTileMemory didn't do anything either.

FinalTileMemory isn't dynamically allocated, it cannot overflow.

Not really sure what could be the issue here. It seems to be definitely tile related and the index should not overflow (yet).

nadiaholmquist commented 1 month ago

Can't seem to reproduce this on an AMD GPU (Steam Deck). Will try on other stuff later.

user18081972 commented 1 month ago

opengl

@takeshineale128 This is the OpenGL renderer we are talking about, so the compute shader one, not the classic one

user18081972 commented 1 month ago

https://github.com/melonDS-emu/melonDS/assets/124319511/2d446be1-d844-46dd-9723-3143277cb115

heres a video example. hopefully it helps

FireNX70 commented 1 month ago

@nadiaholmquist I've tested with a Vega 8 (3500U) on Win10 with an old driver from 2023 and it works fine even at 16x, although attempting to switch resolutions while the game was running completely froze the renderer. I've also tested with LLVMpipe in a VM and it also works fine even at 16x. It looks like it's a bug specific to Nvidia. I'll test Nvidia on Linux with the proprietary driver, I expect that it will behave just like it does in Win10.

Edit: Nvidia on Linux with the proprietary driver gives the exact same result as on Win10.

RSDuck commented 1 month ago

oh no, there still might be undefined behaviour somewhere in the shaders. I'm going to run the shaders with mesa and a debug context eventually to see whether it warns about something.

takeshineale128 commented 1 month ago

not useing dev builds, my 0.9.5 build has no opengl option for classic or compute shader. its prob classic Screenshot 2024-05-18 051021

KostaSaizo7 commented 1 month ago

Blue Dragon - Awakened Shadow Windows 10 Nvidia GPU Compute shader x8 (x7 and below is fine) https://www.youtube.com/shorts/TX5Nq3RdX5Q

user18081972 commented 1 month ago

I decided to downgrade my GPU driver (Nvidia RTX 3080) to the oldest supported one (version 456.71) to see if it might be a 'regression'. The same issue occurs on this driver version from October 7, 2020.

FireNX70 commented 4 weeks ago

glDispatchComputeIndirect will not log OpenGL errors even if the compute work group counts are too high, and checking these would be annoying since I'd have to read the buffer back from VRAM. However, RenderDoc did capture the values used for these. At 6x none of the calls ever go above the minimum of 65535. At 7x there is one call that does exceed that minimum (glDispatchComputeIndirect(<1, 1, 79414>)). This exceeds the maximum supported by my 1080Ti with the 552 driver (it sticks to the minimum of 65535 in the Z axis). I tried replacing the glDispatchComputeIndirect call in the loop with glDispatchCompute(1, 1, 65535). This DID have an effect on the garbage, but it did not fix it. It's likely this is the problem on Nvidia. The local sizes all seem fine.

Edit: Here's a gist on all the testing I've been doing https://gist.github.com/FireNX70/10c72169ea0105f998bab6f51443d42e

RSDuck commented 4 weeks ago

thank you for debugging that.

Regarding the barriers that is obviously a mistake. Though GL_SHADER_STORAGE_BUFFER is defined as 0x90D2 (which should result in a GL error), though if the driver accepts it and only uses the valid bits, most GPUs don't have that fine grained barriers anyway. So there's probably atleast one value set which will insert a correct barrier.

Huh, I always assumed you could do work groups as long as it fits inside a 32-bit int (as it is stored in one). The reason it didn't fix it could be multiple things. If you don't change any of the bindings you will end up just doing the same work multiple times (possibly at the same time).

I need to look into what parallel rdp again, maybe it has a solution to this. I simplified some things, e.g. we always process all polygons at once, because it is limited to a fixed amount on DS. One possible easy out would probably just be to increase the tile size to 16x16 at some point. That would also solve the possible issue of the tile buffer index overrunning.

FireNX70 commented 4 weeks ago

thank you for debugging that.

No problem, once this is fixed I can fully switch to MelonDS.

Huh, I always assumed you could do work groups as long as it fits inside a 32-bit int (as it is stored in one). The reason it didn't fix it could be multiple things. If you don't change any of the bindings you will end up just doing the same work multiple times (possibly at the same time).

There's a minimum required limit of 65535. Implementations are free to establish higher limits. I haven't checked what limits Mesa and Radeon report, Nvidia has a higher limit on the X dimension but sticks to 65535 for the Y and Z dims.

I need to look into what parallel rdp again, maybe it has a solution to this. I simplified some things, e.g. we always process all polygons at once, because it is limited to a fixed amount on DS. One possible easy out would probably just be to increase the tile size to 16x16 at some point. That would also solve the possible issue of the tile buffer index overrunning.

I'm not well versed in graphics nor OpenGL, certainly not enough to understand what the compute renderer's doing. Having said that, it looks to me as though the right thing to do would be to cap VariantWorkCount's Z component to the respective GL_MAX_COMPUTE_WORK_GROUP_COUNT and do multiple passes per variant if needed. Easy to say, probably a pain in the ass to implement. I don't think larger tiles would guarantee the work group count wouldn't go above the limit at higher internal resolutions.

RSDuck commented 3 weeks ago

The problem is that the amount of work done per variant is not known to the cpu at the time it issues the indirect compute calls (that's the reason they are indirect in the first place). And letting the GPU do the binning, then waiting on that and then dispatching as much compute jobs as necessary would work, but would just waste CPU time on waiting.

After sleeping over this, there is technically another solution. The other two axis as well and then put all the axis together into one index in the shader.

I quite like the idea of increasing the tile size at very high resolutions though. The increased tile size results in a lower amount of tiles still filling the same area. As long as the total amount of tiles is less than 2^16 no single variant can result in a larger compute job than that.

What of course remains a potential issue is that the tile buffer can overflow, as it has a fixed size. It would be possible to conservely estimate the size necessary each frame by using the bounding boxes of each polygon, but with some polygon shapes that might result in excessive allocations.

Both solutions don't exclude each other, I guess I can also just implement them both.

FireNX70 commented 3 weeks ago

Scaling the tile size with the internal resolution sounds good. Do keep in mind there's a limit on local size too, although the minimum of 1024 should be fine (1024x1024 for a single tile would be huge). As for the tile buffer; IMO it's better safe than sorry, especially if it doesn't overestimate by a lot.

Edit: 11x (almost 4K on a single screen) uses about 2650 MB of VRAM right now. It's not nothing, but there's some margin for the tile buffer to grow.

FireNX70 commented 2 weeks ago

I've been messing around with scaled tiles. 16x16 tiles work fine, 32x32 tiles (the max size, since the tile size is used for the X and Y local sizes in certain shaders and I'm sticking to the default GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS of 1024) presents a couple of problems.

glDispatchCompute(TilesPerLine*TileLines/32, 1, 1), used for ShaderClearCoarseBinMask, is a problem because the total tile count (TilesPerLineTileLines) with size 32 tiles will not divide by 32 nicely. I think we divide TilesPerLine TileLines by 32 because of the shader's local size. That division works fine with both 16 and 48. It won't be optimal for Nvidia (warp size of 32) nor Radeon (RDNA wave size is 32, GCN wave size is 64). I'm not sure if the divider for CoarseBinStride has to match the local size too. If so, this only works if it's 16. It's as far from optimal as 48 for Nvidia and RDNA but probably worse for GCN.

glDispatchCompute(((gpu.GPU3D.RenderNumPolygons + 31) / 32), ScreenWidth/CoarseTileW, ScreenHeight/CoarseTileH), used for ShaderBinCombined, is also a problem. With 32x32 tiles, CoarseTileH would end up being 128. At lower tile sizes it's 32 or 64, which 192 is a multiple of, so it was guaranteed the vertical resolution would be a multiple of CoarseTileH. 192 is not a multiple of 128, so a CoarseTileH of 128 becomes a problem at certain resolutions (specifically, odd multiples of the base resolution). This (I think) is what would cause a blank strip at the bottom of the screen with odd scales. Making CoarseTileCountY variable should fix it, but it introduced more bands (the number of bands matches the scale). With 6-tile tall coarse tiles (when using 32x32 tiles) the height of a coarse tile is 192 (that way we make sure internal resolution is a multiple of CoarseTileH), so the number of coarse tile lines matches the scale. What I think is really happening here is there's actually one two-tile-thick band per coarse tile because there's a hardcoded 4 somewhere so there's 2 rows of tiles that are ignored per coarse tile.

Just in case here's what I've done: https://github.com/FireNX70/melonDS/tree/try-fixing-the-compute-renderer

user18081972 commented 2 weeks ago

I build your branch, and it fixes it until 10x IR, then static blue lines appear in ChibiRobo

And in Avalon Code static stripes of missing pixels

FireNX70 commented 2 weeks ago

Yeah, those are the stripes I was talking about. Starting at 10x it uses 32x32 tiles.

FireNX70 commented 2 weeks ago

I think what's happening with the stripes is BinCombined's local_size_x must match CoarseTileCountX * CoarseTileCountY. A quick hack to do that (and changing the associated glDispatchCompute call to add 47 and divide by 48 instead 31 and 32) gets rid of the stripes but causes a bunch of geometry to fail to render.

I've also found (through testing with 16x16 tiles) that the scale at which garbage may start being displayed would be 3.5x (with 8x8 tiles, 7x with 16x16 tiles in practice). The worst case scenario seems to be MKDS' desert track. If the fire balls (which are only there on the last 2 laps) get extremely close to the camera, the scale is high enough and the tile size is small enough; you'll get garbage. Driving through them seems to be the best way to trigger this. The next worst scenario, Pokemon BW Driftveil Bridge, only starts breaking at 5x (8x8) and is fine at 4x (8x8), 7x (16x16) and 8x (16x16).

FireNX70 commented 2 weeks ago

I think I got it working. Work count X just wasn't related to the local size for BinCombined. The only remaining problem is ClearCoarseBinMask, but I think changing the local size to 48 and dividing TilesPerLine*TileLines by 48 should fix that.

user18081972 commented 2 weeks ago

On the Pokemon BW2 Driftveil Bridge i indeed do get corruption, at 4x 8x and 16x, any other resolution works perfectly. (tested on: https://github.com/FireNX70/melonDS/commit/6f8ce9fe897fab0277754209122df4181fb4a177)

melonDS-emu / melonDS

Compute renderer displays garbage at high IRs #2047