Closed divVerent closed 2 years ago
FYI on Linux, the same device shows 140fps at the starting point, and this benchmark takes 16.779 seconds wall clock time. glxinfo calls the GPU a "Mesa Intel(R) HD Graphics 500 (APL 2)".
https://github.com/divVerent/aaaaxy/blob/main/nodirectx_windows.go - FYI my workaround to default to OpenGL until this is resolved.
What draw calls are executed? You can see them with -tags=ebitendebug
.
I failed to execute your aaaaxy
2022/07/09 11:43:49.383029 [ERROR] cannot open out my version: could not open local:/generated/version.txt: open third_party/yd_pressure/assets/generated/version.txt: no such file or directory
goroutine 1 [running, locked to thread]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
/usr/local/go/src/runtime/debug/stack.go:16 +0x19
github.com/divVerent/aaaaxy/internal/log.Fatalf({0x44f0a3d, 0x1d}, {0xc00013bf48, 0x1, 0x1})
/Users/hajimehoshi/ebitengine-games/aaaaxy/internal/log/log.go:101 +0x3c
main.main()
/Users/hajimehoshi/ebitengine-games/aaaaxy/main.go:98 +0x135
2022/07/09 11:43:49.383077 [FATAL] could not initialize game: could not initialize version: could not open local:/generated/version.txt: open third_party/yd_pressure/assets/generated/version.txt: no such file or directory
exit status 125
This likely means you got the wrong binary - the one from GitHub Actions requires a source checkout that has performed "make generate" with the correct GOOS and GOARCH.
To reproduce, the binary here will work: https://github.com/divVerent/aaaaxy/releases/download/v1.2.141/aaaaxy-windows-amd64-v1.2.141.zip (just tested that on my Windows box).
Nevertheless, now building a "release" with ebitendebug in it so I can run that on Windows (don't have a dev environment there).
Uploaded an ebitendebug build on https://drive.google.com/drive/folders/1QfiiH53DsoV48EKIXF3U9V9yxVaR7txb?usp=sharing - will test it on the machines when I find time to see if anything suspicious is in the render calls list.
Typical draw call list on Linux/OpenGL (did a force-quit while the game screen was open so the blur behind the menu doesn't show up):
Update count per frame: 1
Internal image sizes:
2: (16, 16)
3: (16, 16)
4: (1024, 512)
5: (1024, 512)
6: (2048, 1024)
7: (1680, 1050)
8: (2048, 2048)
10: (1024, 512)
11: (1024, 512)
12: (1024, 512)
13: (1024, 512)
14: (128, 16)
Graphics commands:
draw-triangles: dst: 11 <- src: [8, (nil), (nil), (nil)], dst region: (x:1, y:1, width:640, height:360), num of indices: 6, colorm: {}, mode: copy, filter: nearest, address: unsafe, even-odd: false
draw-triangles: dst: 11 <- src: [8, (nil), (nil), (nil)], dst region: (x:1, y:1, width:640, height:360), num of indices: 1980, colorm: {}, mode: source-over, filter: nearest, address: unsafe, even-odd: false
draw-triangles: dst: 12 <- src: [8, (nil), (nil), (nil)], dst region: (x:1, y:1, width:640, height:360), num of indices: 6, colorm: {}, mode: copy, filter: nearest, address: unsafe, even-odd: false
draw-triangles: dst: 12 <- src: [8, (nil), (nil), (nil)], dst region: (x:1, y:1, width:640, height:360), num of indices: 1929, colorm: {}, mode: source-over, filter: nearest, address: unsafe, even-odd: false
draw-triangles: dst: 13, shader, num of indices: 6, mode copy
draw-triangles: dst: 12, shader, num of indices: 6, mode copy
draw-triangles: dst: 4, shader, num of indices: 6, mode copy
draw-triangles: dst: 13, shader, num of indices: 6, mode copy
draw-triangles: dst: 10, shader, num of indices: 6, mode copy
draw-triangles: dst: 5, shader, num of indices: 6, mode copy
draw-triangles: dst: 6, shader, num of indices: 6, mode copy
draw-triangles: dst: 7 (screen) <- src: [8, (nil), (nil), (nil)], dst region: (x:0, y:0, width:1680, height:1050), num of indices: 6, colorm: {}, mode: copy, filter: nearest, address: unsafe, even-odd: false
draw-triangles: dst: 7 (screen) <- src: [6, (nil), (nil), (nil)], dst region: (x:0, y:0, width:1680, height:1050), num of indices: 6, colorm: {}, mode: copy, filter: nearest, address: unsafe, even-odd: false
This matches my expectations - there is screen clearing, tiles rendering, polygon rendering for visible area, blurring that polygon, mixing the two together with previous frame, blurring the output for next frame, and finally copying all that stuff to the screen with a CRT filter, after which Ebiten will blit that to the screen again (but with nearest filter, thanks to SetScreenFilterEnabled). Haven't checked yet if it looks any different when using DirectX.
The render call list seems to be the same when using DirectX backend. I am sure I am using the backend because whenever I launch with DirectX, at early startup there is a white rectangle on the screen where my command prompt was - with OpenGL this doesn't happen.
From your result of ebitendebug, there is nothing odd.
I'd like to modify and try your aaaaxy on my local machine (macOS and/or Windows). Would it be possible to build it myself?
EDIT: I forgot to read README. Thanks,
@divVerent Could you try a32a137fa805f8dca08e499a85f6e84fb96361c8? Thanks,
Current profiling result (a32a137fa805f8dca08e499a85f6e84fb96361c8, -vsync=false, warp (software rendering) on Parallels)
I will try your change - I do not think this issue is vsync=off specific, however "unnecessary flushes" is certainly a possibility.
Although I'd be surprised if this is due to ReadPixels/ReplacePixels being on a different command chain - I never do those outside precaching at the start of the game or text rendering in my menu (in-game text is precached too to avoid performance loss).
On Sat, Jul 9, 2022 at 11:42 AM Hajime Hoshi @.***> wrote:
Current profiling result (a32a137 https://github.com/hajimehoshi/ebiten/commit/a32a137fa805f8dca08e499a85f6e84fb96361c8, -vsync=false)
[image: image] https://user-images.githubusercontent.com/16950/178112745-7b33a9d1-1454-4a13-b402-ac506325bf7c.png
— Reply to this email directly, view it on GitHub https://github.com/hajimehoshi/ebiten/issues/2188#issuecomment-1179563917, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5NMDTO53S5WG3XT6PDITVTGMXFANCNFSM53CA22LA . You are receiving this because you were mentioned.Message ID: @.***>
Note: I cannot patch https://github.com/hajimehoshi/ebiten/commit/a32a137fa805f8dca08e499a85f6e84fb96361c8 on top of Ebiten v2.3.5, but I am going to retest against Ebiten main which contains the change as well as 0035ba0bd1a35c4a27c2933af17276af7b7b7e1d.
Although I'd be surprised if this is due to ReadPixels/ReplacePixels being on a different command chain - I never do those outside precaching at the start of the game or text rendering in my menu (in-game text is precached too to avoid performance loss).
Before the fix, commands were flushed every time when DrawTriangles
was called, regardless of copyCommandList
usages. So, even though you don't call ReadPixels/ReplacePixels (and actually you don't call them), commands were flushed and then waiting happened unnecessarily.
Note: I cannot patch https://github.com/hajimehoshi/ebiten/commit/a32a137fa805f8dca08e499a85f6e84fb96361c8 on top of Ebiten v2.3.5, but I am going to retest against Ebiten main which contains the change as well as https://github.com/hajimehoshi/ebiten/commit/0035ba0bd1a35c4a27c2933af17276af7b7b7e1d.
Note that I don't plan to backport this change to 2.3 branch as this is just a performance improvement.
With your changes I now get 35fps at game start (OpenGL remains at 110fps). Way better than 19fps, so the flushing fixes certainly helped, but I'd really like to get up to 60fps before I can make DirectX mode default.
At fastest render settings (in the menu: graphics=SVGA quality=Lowest) I now get 150fps with DirectX, 215fps with OpenGL.
Phasing up render settings on DirectX again:
So the big steps are from low to medium, and from high to max. Looking in source code they are (https://github.com/divVerent/aaaaxy/blob/0878d763d4bedad077d9416eaa13b2bd5e3251c3/internal/menu/settings.go#L255):
Peculiarly, though, if I move quality to max but graphics to SVGA, I also get 100fps, which is very much acceptable. So the complex dither shader is expensive, and I can have either the dither shader (https://github.com/divVerent/aaaaxy/blob/main/assets/shaders/dither.kage.tmpl) or the CRT shader active, but not both, if I want to be better than 60fps.
I wonder if reality is that all complex shaders are more expensive in DirectX than in OpenGL mode, and that there is also a hard cap on the framerate (in OpenGL, at lowest possible settings, I can reach 220fps at most, BTW, but I bet it's then simply CPU bound by my render code).
Thank you for the trial! I'll take a look further. My current suspect is how much the shader programs are optimized.
BTW there quite certainly are things in those shaders I could maybe write better; if it helps, here are the template settings the dither shader runs with:
.BayerSize = 0
.RandomDither = 0
.PlasticDither = 1
.TwoColor = 1
The linear2xcrt shader runs with:
.CRT = 1
(this rather complex part can be turned off by passing -screen_filter=linear2x, which makes it a fancy upscaler but no longer contain the scanline and bending effect)
I simply added an optimization flag to D3DCompile (bf0f3d304bd5c92f26d9df2b5591d1f848a255f1). The same method might work for Metal.
I'll take a look further later (maybe tomorrow), but my current guess is that the HLSL code generated by Kage might not be good. Thank you for a lot of helpful information.
I'd be happy if you could take a look at bf0f3d304bd5c92f26d9df2b5591d1f848a255f1. Thanks,
With my Windows PC (Vaio LAPTOP-31PU6LDL), the FPS was about 70 in the original aaaaxy with Ebitengine v2.3.5, and about 100 with Ebitengine bf0f3d3. The FPS was 220 with OpenGL. So, the FPS should be increased but is still 2x lower than OpenGL.
I'm trying to add more optimization. Remaining tasks I can do are:
EDIT: My current guess is that the output of Kage is not matured, and the HLSL compiler's optimization doesn't work well. For example, examples/airship uses about 8 draw commands with the Ebitengine default shaders, but can keep over 400 FPS on my machine with DirectX, and 600 FPS with OpenGL.
EDIT2: I'm not 100% sure, but 4c121ae5eb13bffc6bb85e3d74fdc7b98cf5350e significantly improved the situation.
With my Windows PC (Vaio LAPTOP-31PU6LDL), the FPS was about 70 in the original aaaaxy with Ebitengine v2.3.5, and about 100 with Ebitengine https://github.com/hajimehoshi/ebiten/commit/bf0f3d304bd5c92f26d9df2b5591d1f848a255f1. The FPS was 220 with OpenGL. So, the FPS should be increased but is still 2x lower than OpenGL.
With the current latest commit b8367da7e235036e9c1a9834de50a0a604ec69d8, aaaaxy could keep 150-200 FPS!
On my machine, with Ebiten at b8367da7e235036e9c1a9834de50a0a604ec69d8 (TODO: should verify I actually built against that and there wasn't some caching effect): game starts out at 21fps but if I let it sit there, it soon moves to 31fps and stays there.
At SVGA/Max I get 119fps.
At VGA/High I get 122fps.
This is somewhat illogical - VGA/Max settings should never take longer to render than one frame SVGA/Max and one frame VGA/High, which would yield 1/(1/119+1/122) ~ 60fps, but it's substantially slower than that. Any way how those shaders could negatively interact with each other? They're in different render passes after all.
Confirmed I was actually including current code - the comment from b8367da7e235036e9c1a9834de50a0a604ec69d8 is in the binary I tested.
One thing I will later to (likely not before end of next week) is experiment with my shader code, comment things out, to see which parts are the expensive parts. There is a way to do this without recompiling (mainly note for myself so I know how to speed this up when I have time for it):
aaaaxy-windows-amd64 -dump_embedded_assets=data
# make edits in data/assets/shaders/*
aaaaxy-windows-amd64 -cheat_replace_embedded_assets=data -batch
(-batch
turns off the error dialog at the end when cheating, useful if I want to use Measure-Command with this as above)
As for a possible interaction between the shaders: both palette reduction (enabled when graphics is set to VGA or lower) and CRT filter (enabeld at max quality) add one render pass; the former adds a 640x360->640x360 pass, and the latter adds a 640x360->intermediate_res pass and change Ebiten's final pass from 640x360->output_res to intermediate_res->output_res (where intermediate_res is the min of 2560x1440 and output_res).
Do note that this postprocessing uses the same input as the round of the two blur render passes that remember a blurred version of previous screen contents for the fade out effect in the "fog of war" area. As there is no data dependency on that output within the same frame, it is conceivable that these two operations might run partially in parallel (not sure how smart DirectX is, but OpenGL probably is not smart enough to do that kind of optimization).
Are there any DirectX-level debugging tools that could tell me if any such interaction might exist? Like a DirectX equivalent of apitrace?
-draw_outside=false
disables the blur pass that remembers previous screen content, but keeps the two postprocessing shaders active - above 100fps with that.
With dither.kage.tmpl
neutered (all commented out, and a Fragment function added that just returns imageSrc0UnsafeAt(texCoord)
), I still get ~30fps.
Same treatment also done to linear2xcrt.kage.tmpl
and I get 37fps. Still nowhere near the 100fps.
(BTW: when doing this, be sure to check log messages on the console - my game code tends to skip shaders and work without them if compilation fails and it can detect that - too bad GLSL/HLSL-level compile errors happen as part of the main loop where I can't detect them, so this only really helps if Kage changes incompatibly)
So now I have ruled out the contents of the shaders (as seen above, optimization did help, but only to some extent); the slowness comes from the render passes themselves.
So the FPS is still around 30 with the default state, right?
EDIT: What about github.com/hajimehoshi/ebiten/v2/examples/airship
on your mahchine?
Do note that this postprocessing uses the same input as the round of the two blur render passes that remember a blurred version of previous screen contents for the fade out effect in the "fog of war" area. As there is no data dependency on that output within the same frame, it is conceivable that these two operations might run partially in parallel (not sure how smart DirectX is, but OpenGL probably is not smart enough to do that kind of optimization). Are there any DirectX-level debugging tools that could tell me if any such interaction might exist? Like a DirectX equivalent of apitrace?
Sorry but I'm not familiar with DirectX tools. It is possible that OpenGL implicitly executes some commands in parallel, while DirectX doesn't unless they are explicitly ordered. And, Ebitengine doesn't specify parallel executions.
I'm quite confused at what kind of shaders and how they interact in your application... A figure would be helpful. Thanks,
So now I have ruled out the contents of the shaders (as seen above, optimization did help, but only to some extent); the slowness comes from the render passes themselves.
Very interesting. Perhaps, does the destination size matter?
Issue may be GPU specific though - I have this issue on one of these: https://www.amazon.com/2019office%E3%80%91-Ultra-Light-High-Speed-High-Performance-Notebook/dp/B09CQ22335/ref=sr_1_3?keywords=7+inch+laptop&qid=1657310835&sr=8-3 - according to Device Manager I have an Intel(R) HD Graphics 500.
Celeron J4125 has UHD Graphics 600 instead of UHD Graphics 500.
Could you confirm that this is the machine you are testing?
To be clear, I got a device that looks quite much the same on Ali Express and has all connectors in the same place - I assume the Amazon one is the same, but it is possible that the innards are changing without the exterior look changing.
My device has a Celeron J3455 according to /proc/cpuinfo, so yeah, it isn't quite the same.
On Mon, Jul 11, 2022 at 2:08 AM Hajime Hoshi @.***> wrote:
Issue may be GPU specific though - I have this issue on one of these: https://www.amazon.com/2019office%E3%80%91-Ultra-Light-High-Speed-High-Performance-Notebook/dp/B09CQ22335/ref=sr_1_3?keywords=7+inch+laptop&qid=1657310835&sr=8-3
- according to Device Manager I have an Intel(R) HD Graphics 500.
Celeron J4125 has UHD Graphics 600 instead of UHD Graphics 500.
Could you confirm that this is the machine you are testing?
— Reply to this email directly, view it on GitHub https://github.com/hajimehoshi/ebiten/issues/2188#issuecomment-1180003092, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5NMAUNFQEJSNTFCD65PTVTO26JANCNFSM53CA22LA . You are receiving this because you were mentioned.Message ID: @.***>
The exact set of render calls depends on the settings, so a figure is rather hard to make. The general process at high settings is:
On Sun, Jul 10, 2022 at 11:36 PM Hajime Hoshi @.***> wrote:
I'm quite confused at what kind of shaders and how they interact in your application... A figure would be helpful. Thanks,
— Reply to this email directly, view it on GitHub https://github.com/hajimehoshi/ebiten/issues/2188#issuecomment-1179930499, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5NMAL3GKPSIIONC57B2TVTOJEZANCNFSM53CA22LA . You are receiving this because you were mentioned.Message ID: @.***>
Thanks. The current performance bottleneck is the existence of the shader
send P to CRT shader, output is C (typically at screen res, capped to 2560x1440)
and whether the shader's content is empty or not doesn't matter. Do I understand correctly?
I'm looking for a machine with the same chipset (Celeron J3455)
https://www.amazon.co.jp/dp/B0875LXTRC https://www.amazon.co.jp/dp/B096S7Y23N https://www.amazon.co.jp/dp/B0B14Z49GD https://www.amazon.co.jp/dp/B09R4FWC4D https://www.amazon.co.jp/dp/B07TXYRXW4
EDIT: I bought a used EZBook X3 Pro 64G
I am not yet sure that this is the bottleneck. Yes, removing the pass fixed framrate, but removing the one that applies the palette (even if the shader is a NOP) fixes it too.
Which makes me think that the issue may be something else. Am I e.g. exceeding some limit in VRAM usage? Does it otherwise matter how many passes run?
But then why is the OpenGL backend not affected equally?
On Mon, Jul 11, 2022, 09:10 Hajime Hoshi @.***> wrote:
Thanks. The current performance bottleneck is the existence of the shader
send P to CRT shader, output is C (typically at screen res, capped to 2560x1440)
and whether the shader's content is empty or not doesn't matter. Do I understand correctly?
— Reply to this email directly, view it on GitHub https://github.com/hajimehoshi/ebiten/issues/2188#issuecomment-1180390409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5NMCWY6VLL7P4FPYIDB3VTQMKPANCNFSM53CA22LA . You are receiving this because you were mentioned.Message ID: @.***>
It's possible that what the OpenGL driver does more sophisticated things than I do with DirectX. I'll take a look further after the machine I ordered arrives.
I now tried revamping how textures are allocated to have different strategies rather than always using the same temp texture for the same purpose - but this does not change DirectX performance at all, so we now know that's not it either.
This is in my branch managed-offscreens
in my game - I am unsure if I really want to merge that, but it eliminates two 640x360 textures by default. In particular this rules out "VRAM exhaustion".
I found wrong descriptor table usages (#2201) and fixed it. Could you try b3267a712681fd46bbf99519eb1233a5dd12d08f? Thanks,
I've received the machine with Intel HD 500 Graphics so I'll try tomorrow.
This is in my branch managed-offscreens in my game - I am unsure if I really want to merge that, but it eliminates two 640x360 textures by default. In particular this rules out "VRAM exhaustion".
~Did this work on your machine with high FPS?~ OK so this doesn't change the performance...
Not seeing any differences even now - but also, peculiarly, I cannot run dxcap.exe to get a capture of DirectX usage. In capture mode (dxcap -file aaaaxy.vsglog -c aaaaxy-windows-amd64
) just hangs around after Ebiten opens its window. Having said that, I've never used dxcap
before, so this might be user error.
I also can no longer reproduce getting 100fps, even with the binary that I had before; I will retest later, suspecting this simply to be some background activity.
PIX (https://devblogs.microsoft.com/pix/download/) shows a lot of warnings about 131 "redundant transition to unused state" in a single frame, as well as some redundant ResourceBarriers. Maybe that is related?
Can't do much in PIX, this laptop has a 800x480 screen and I can't reach half the UI of it.
OK so FPS doesn't change... (though I believe the fix is necessary to use GPUs correctly)
"redundant transition to unused state" might be a very good hint. PIX didn't work well on my Parallels machine. I'll try the new machine (Intel HD Graphics 500) later anyway.
I think I could reproduce your issue, but the situation might be different.
In all the cases I disabled vsync.
With OpenGL, the FPS is around 110 even with the max quality.
EDIT: Oops, I tested this with Ebitengine v2.3 accidentaly. With the latest commit, it reached 88 FPS with the max quality.
PIX (https://devblogs.microsoft.com/pix/download/) shows a lot of warnings about 131 "redundant transition to unused state" in a single frame, as well as some redundant ResourceBarriers. Maybe that is related?
I coulnd't see such warnings. How did you see them?
I realized that FPS depends on the player's position, and in some places the FPS is actually less than 30. I'll take a look further
I launched PIX, selected the game binary and set the environment variable EBITEN_GRAPHICS_LIBRARY to directx there, then launched the game from there and once all stabilized, hit print screen.
I may then have had to click something in the bottom area to let it actually play back the frame, and the warnings view then showed something - including links to click to get more warnings.
As for the numbers on your system - interesting you do not get such a sharp cutoff. I assume in OpenGL mode the framerate is substantially higher for you too?
To get the test more similar, maybe try hitting F (toggle full screen) then resize the window to about 800x480 (which is all my 7" laptop does)?
On Thu, Jul 14, 2022, 03:27 Hajime Hoshi @.***> wrote:
PIX (https://devblogs.microsoft.com/pix/download/) shows a lot of warnings about 131 "redundant transition to unused state" in a single frame, as well as some redundant ResourceBarriers. Maybe that is related?
I coulnd't see such warnings. How did you see them?
— Reply to this email directly, view it on GitHub https://github.com/hajimehoshi/ebiten/issues/2188#issuecomment-1184093659, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5NMC4MNWQA5EJMWYNMXTVT66MDANCNFSM53CA22LA . You are receiving this because you were mentioned.Message ID: @.***>
I'll try pressing print screen later, thanks.
As for the numbers on your system - interesting you do not get such a sharp cutoff. I assume in OpenGL mode the framerate is substantially higher for you too?
Yes, higher and more stable with OpenGL.
To get the test more similar, maybe try hitting F (toggle full screen) then resize the window to about 800x480 (which is all my 7" laptop does)?
I'm already using a window mode with 1280x720 size. I'll try 800x480 later but I don't think the window size matters here.
I pressed print screen and a wpix files are found. I confirmed DirectX calls and warnings: "Consecutive calls to ResourceBarrier". Yeah, I think this is it. The blur rendering uses an offscreen and this causes state switchings. I'll look for a better to use offscreens.
EDIT: ResourceBarrier
takes multiple switchings. I should have batched them.
EDIT2: Hmmm batching ResourceBarrier
slowed applications... 🤔 https://github.com/hajimehoshi/ebiten/tree/issue-2188-batching
EDIT3: Never mind, this didn't cause regressions (but didn't improve the performance very much). I'll merge https://github.com/hajimehoshi/ebiten/pull/2203 later anyway.
Easiest way to reproduce for now is with my game AAAAXY, which currently (for this reason) defaults to OpenGL rather than DirectX:
In PowerShell, run:
(to view runtime fps, can run
.\aaaaxy-windows-amd64.exe -load_config=false -vsync=false -show_fps
, which shows me 110fps at the start of the game in OpenGL, and 19fps in DirectX - the lower difference in TotalMilliseconds is primarily due to loading time "equalizing" things somewhat)Issue may be GPU specific though - I have this issue on one of these: https://www.amazon.com/2019office%E3%80%91-Ultra-Light-High-Speed-High-Performance-Notebook/dp/B09CQ22335/ref=sr_1_3?keywords=7+inch+laptop&qid=1657310835&sr=8-3 - according to Device Manager I have an Intel(R) HD Graphics 500.
-vsync=false
is most certainly not at fault - with vsync on, I can't reach 60fps either, which is very noticeable.