Open Venemo opened 5 years ago
Actually most motherboards have very few PCIE x16 slots, usually one or two. While there are might be more full sized PCIE slots, they are actually working as x4 ones. So maybe @axeldavy could try and test his video card on x4 PCIE slot.
@Venemo,
You could try radeon top
program, that shows utilization of specific GPU subsystems. It doesn't sound like the bottleneck is there, but it might give some useful hint.
You could also take a look what options are available for you in R600_DEBUG
, despite its name it works on latest radeons too. Do an R600_DEBUG=help glxgears
and you will get all available options (some drivers have more). Try everything that is not about dumping stuff to the console. (Try it one by one.)
Thunderbold3 definitely complicates things.
I would add testing all working iommu
types, in case you do need a hardware support for it.
I had a similar slowdown caused by the fact that my distribution had i586 optimized glibc, that used a memcpy() with hack for original Pentium. It would read from the destination memory, before overwriting it with real data. When used over my PCIE x16 it would get something like 25% CPU in perf top
and 50 fps instead of 70fps (benchmark avg).
This reminds me...
You had two separate perf top
logs that showed si_set_constant_buffer
and amdgpu_mm_rreg
as top usage functions... but I could not quite understand what have you done differently to get one or the other result.
It's not compilation or inline-ing, as one is from mesa3d
and the other is from the kernel
.
If you are compiling your own mesa3d
try to use -O1
or -Og
, it might give some hints. Actually compiling with debug support would be very good idea.
When you run perf top
you can select function with arrow keys. Press enter
once, then press it second time to enter into Annotate
. In that mode it will show you how hot each instructions from that functions are. If you have debug support it might show you even the source.
Also, I've tried to make a line that would show all graphs on the screen, try it in case it provides some hint.
export GALLIUM_HUD=\
".dfps+cpu+GPU-load+temperature,.dGTT-usage+VRAM-usage,num-compilations+num-shaders-created;"\
"primitives-generated+draw-calls,samples-passed+ps-invocations+vs-invocations,buffer-wait-time;"\
"CS-thread-busy+gallium-thread-busy,dma-calls+cp-dma-calls,num-bytes-moved;"\
"num-vs-flushes+num-ps-flushes+num-cs-flushes+num-CB-cache-flushes+num-DB-cache-flushes"
You can check if you have an extra graphs with GALLIUM_HUD=help glxgears
and add them too.
@iiv3 Thanks for the suggestions! I'll try them.
I had a similar slowdown caused by the fact that my distribution had i586 optimized glibc, that used a memcpy() with hack for original Pentium. It would read from the destination memory, before overwriting it with real data.
What was the solution to this one? How do I check if this is the case?
You had two separate perf top logs that showed si_set_constant_buffer and amdgpu_mm_rreg as top usage functions... but I could not quite understand what have you done differently to get one or the other result.
I honestly don't know, it is possible that I have messed up something in that WINEPREFIX. I will re-install the game in a clean prefix and use that for further testing.
Every frame, vertices must be sent to the graphic card for the draw calls. There's also the draw commands sent by the driver. Good games don't upload anything else.
As mentionned already, the vertices for SC2 are stored in a permanently mapped buffer. The allocation is done in GTT, ie graphic memory in cpu ram. At draw time, the graphic card fetches the data there. The thing is if that was the limiting factor, the GPU-load would be higher: indeed the GPU is running shaders when fetching these vertice data, and it should count as GPU runnning. Still, you can try use VRAM for these allocations (the size of the mappable VRAM is restricted to 64MB, but it should be enough there). Replace all PIPE_USAGE_STREAM in buffer9.c and nine_buffer_upload.c to PIPE_USAGE_DEFAULT. Possibly in your case it should be faster as the data would make it to the GPU sooner.
As for the draw commands, there's not much to do.
Maybe as iive suggested you have wrong flags somewhere which leads to unoptimal functions somewhere.
Okay, so I upgraded to Fedora 29, kernel 5.0-rc1 and wine staging 4.0-rc1. Then I deleted the wine prefix and reinstalled the game in a clean prefix.
If this doesn't help either, try entering the "amdgpu_mm_rreg" in perf top and see which instructions are getting most heat.
There is a mov
instruction which takes ~96% of the time spent in that function.
You could try radeon top program, that shows utilization of specific GPU subsystems. It doesn't sound like the bottleneck is there, but it might give some useful hint.
Here is the output. I couldn't copy-paste the bars from the terminal, but the percentages are clearly visible.
Graphics pipe 39,17% │
───────────────────────────────────────────────┼─
Event Engine 0,00% │
│
Vertex Grouper + Tesselator 10,00% │
│
Texture Addresser 24,17% │
│
Shader Export 31,67% │
Sequencer Instruction Cache 8,33% │
Shader Interpolator 33,33% │
│
Scan Converter 34,17% │
Primitive Assembly 10,83% │
│
Depth Block 30,83% │
Color Block 30,00% │
│
2815M / 3995M VRAM 70,44% │
73M / 4093M GTT 1,79% │
1,75G / 1,75G Memory Clock 100,00% │
1,24G / 1,24G Shader Clock 100,00% │
You could also take a look what options are available for you in R600_DEBUG
I took a look, and tried a few options but some of them caused a crash, while others caused noticable performance decrease, so I didn't get too far with it. Is there a specific option you want me to try?
I would add testing all working iommu types, in case you do need a hardware support for it.
With iommu=off
my wifi card is not recognized, which is a problem. With iommu=on intel_iommu=on
I get roughly +5 fps, and the "gallium thread busy" line from GALLIUM_HUD
goes to zero. So I guess that is a good thing at least.
If you are compiling your own mesa3d try to use -O1 or -Og, it might give some hints. Actually compiling with debug support would be very good idea.
I just use the mesa that comes with Fedora. I'll also try the version from the "che" repo which is supposed to be a newer version with more optimizations enabled, will report back if it helps.
Also, I've tried to make a line that would show all graphs on the screen, try it in case it provides some hint.
Here is a screenshot from the graphs with intel_iommu=on
:
Oops, I accidentally hit "close and comment", sorry. Reopening it now.
@axeldavy Thanks! How come that the samples passed and the ps/vs invocations are so low on your machine? Also the primitives generated seem much less than mine.
It depends on the scene and resolution.
Okay, I tried the patch suggested by @axeldavy and replaced PIPE_USAGE_STREAM
to PIPE_USAGE_DEFAULT
in the specified files. It definitely helps, but does not solve the problem.
Hotspots from perf top
:
13,66% d3dadapter9.so.1.0.0 [.] si_set_constant_buffer
4,88% anonmap.WTz65d (deleted) [.] 0x000000000023f415
2,69% d3dadapter9.so.1.0.0 [.] NineDevice9_SetIndices
2,57% d3dadapter9.so.1.0.0 [.] amdgpu_do_add_real_buffer
2,19% d3dadapter9.so.1.0.0 [.] amdgpu_cs_add_buffer
1,30% dsound.dll.so [.] 0x000000000002997a
1,07% anonmap.WTz65d (deleted) [.] 0x000000000023f403
0,63% libc-2.28.so [.] __memmove_avx_unaligned_erms
0,59% anonmap.WTz65d (deleted) [.] 0x0000000001272ed6
0,58% dsound.dll.so [.] 0x00000000000298e1
Here is the hot spot from the annotated si_set_constant_buffer
:
│ si_descriptors.c:1212 ▒
0,01 │ mov (%rsi),%rax ▒
0,00 │ lea (%rax,%r14,8),%rdx ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:139 ▒
0,07 │ mov (%rdx),%rsi ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:79 ▒
0,01 │ cmp %rsi,%r8 ▒
│ ↓ je 8c ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:87 ▒
0,00 │ test %rsi,%rsi ▒
│ ↓ je 8c ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:89 ▒
98,08 │ lock subl $0x1,(%rsi) ▒
1,16 │ ↓ jne 8c ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:146 ▒
0,00 │ 4d: mov 0x20(%rsi),%rax ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:148 ▒
│ mov 0x28(%rsi),%rcx ▒
│ mov %rdx,0x8(%rsp)
Hotspots according to perf top
:
16,67% radeonsi_dri.so [.] u_upload_alloc ◆
3,12% radeonsi_dri.so [.] cso_set_vertex_buffers ▒
2,27% radeonsi_dri.so [.] tc_call_draw_vbo ▒
1,84% radeonsi_dri.so [.] amdgpu_cs_add_buffer ▒
1,68% libc-2.28.so [.] __memmove_avx_unaligned_erms ▒
1,53% radeonsi_dri.so [.] amdgpu_do_add_real_buffer ▒
1,25% wined3d.dll.so [.] 0x000000000005a390 ▒
1,20% wined3d.dll.so [.] 0x000000000005b5e4 ▒
1,17% wined3d.dll.so [.] 0x000000000005b870 ▒
0,90% libpthread-2.28.so [.] __pthread_mutex_lock ▒
0,79% ntdll.dll.so [.] RtlEnterCriticalSection ▒
0,73% wined3d.dll.so [.] 0x000000000005a3af ▒
0,66% wined3d.dll.so [.] 0x000000000005a39a ▒
0,64% wined3d.dll.so [.] 0x000000000005b52d ▒
0,63% wined3d.dll.so [.] 0x000000000005b5d0 ▒
0,63% wined3d.dll.so [.] 0x000000000005b670 ▒
0,61% wined3d.dll.so [.] 0x000000000005b4a6 ▒
0,61% ntdll.dll.so [.] RtlLeaveCriticalSection ▒
0,59% [amdgpu] [k] amdgpu_mm_rreg ▒
0,57% wined3d.dll.so [.] 0x000000000005aad0 ▒
0,57% libpthread-2.28.so [.] __pthread_mutex_unlock_usercnt ▒
0,53% wined3d.dll.so [.] 0x000000000005b5ed ▒
0,51% radeonsi_dri.so [.] _mesa_unmarshal_dispatch_cmd ▒
0,50% wined3d.dll.so [.] 0x000000000005aabd
Which means that wined3d+PBA still beats nine by a dozen or so FPS. What stands out to me is:
Further notes:
mesa.spec
and the whole x86_64
mesa package were rebuilt, but I only installed the mesa-d3d
packages from the build output. (I assume the change doesn't affect the rest of the packages.)Where do we go from here?
Here is what lspci -vvv
has to say about the GPU:
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef) (prog-if 00 [VGA controller])
Subsystem: Sapphire Technology Limited Device e343
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 128 bytes
Interrupt: pin A routed to IRQ 156
Region 0: Memory at 60000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at 70000000 (64-bit, prefetchable) [size=2M]
Region 4: I/O ports at 2000 [size=256]
Region 5: Memory at ac000000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at ac040000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x4 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee006b8 Data: 0000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [200 v1] Resizable BAR <?>
Capabilities: [270 v1] Secondary PCI Express <?>
Capabilities: [2b0 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable-, Smallest Translation Unit: 00
Capabilities: [2c0 v1] Page Request Interface (PRI)
PRICtl: Enable- Reset-
PRISta: RF- UPRGI- Stopped+
Page Request Capacity: 00000020, Page Request Allocation: 00000000
Capabilities: [2d0 v1] Process Address Space ID (PASID)
PASIDCap: Exec+ Priv+, Max PASID Width: 10
PASIDCtl: Enable- Exec- Priv-
Capabilities: [320 v1] Latency Tolerance Reporting
Max snoop latency: 71680ns
Max no snoop latency: 71680ns
Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [370 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Kernel driver in use: amdgpu
Kernel modules: amdgpu
Here is what amdgpu
's sysfs has to say:
[root@timur-xps ~]# cat /sys/class/drm/card1/device/pp_dpm_pcie
0: 2.5GT/s, x8
1: 8.0GT/s, x16 *
Reducing the bandwidth using echo 0 > /sys/class/drm/card1/device/pp_dpm_pcie
did not seem to have any effect on the frame rate, but I'm honestly not sure how reliable pp_dpm_pcie
is. Or maybe the TB3 bottleneck is more severe than the restriction that it can impose.
Trying to determine the available bandwidth using clpeak
is not possible due to the missing PCI-E atomics support:
kfd kfd: skipped device 1002:67df, PCI rejects atomics
However this egpu.io page states the following: Intel throttles 32Gbps-TB3 to 22Gbps which benchmarks as 20.40Gbps (AMD) or 18.91Gbps (Nvidia) in the important H2D direction.
I took a look at the hot spots for Nine in perf top
and investigated a bit:
si_set_constant_buffer
-> pipe_resource_reference
-> pipe_reference_described
-> p_atomic_dec_zero
which is a call to either __atomic_sub_fetch
or __sync_sub_and_fetch
(preferring the __atomic
version if available)
Looking at the other items like NineDevice9_SetIndices
and amdgpu_do_add_real_buffer
, all of them wait for a similar synchronization primitive.
Cannot reproduce here either
without vsync: and with vsync:
@dhewg @axeldavy Which distro do you guys use? Just asking because I might just try running that distro as an experiment. Maybe there is some package that is better optimized there than here, and maybe that makes the difference. Or maybe mesa itself is compiled with different flags?
I'm on Debian, @axeldavy on Arch iirc.
I built mesa myself though, using CFLAGS="-march=native" CXXFLAGS="-march=native" meson ~/src/mesa --prefix /opt/andre/mesa -Ddri-drivers= -Dgallium-drivers=radeonsi,swrast -Dvulkan-drivers=amd -Dgallium-nine=true -Dosmesa=gallium
Okay, so @axeldavy Suggested an approach that helps find whether the constant data is the bottleneck or not:
08:20 mannerov: in nine_state.c 08:20 mannerov: prepare_vs_constants_userbuf 08:20 mannerov: when cb.buffer_size is set 08:21 mannerov: set it to the maximum value instead, that is sizeof(float[4]) * NINE_MAX_CONST_ALL 08:21 mannerov: if it doesn't decrease fps, it is not limiting
Here is how the patch file looks:
From debd5413b1d14a28f26ae40c1d907df621044c8a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Timur=20Krist=C3=B3f?= <venemo@msn.com>
Date: Fri, 18 Jan 2019 14:38:49 +0100
Subject: [PATCH] Change prepare_vs_constants_userbuf to always use the maximum
size.
---
src/gallium/state_trackers/nine/nine_state.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index bf39ebc9b4..209518ac2e 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -431,7 +431,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
struct pipe_constant_buffer cb;
cb.buffer = NULL;
cb.buffer_offset = 0;
- cb.buffer_size = context->vs->const_used_size;
+ cb.buffer_size = sizeof(float[4]) * NINE_MAX_CONST_ALL;
cb.user_buffer = context->vs_const_f;
if (context->swvp) {
--
2.20.1
When I run the game with this patch:
perf top
.Now the perf top
looks like this:
17,21% d3dadapter9.so.1.0.0 [.] si_decompress_textures
12,59% d3dadapter9.so.1.0.0 [.] si_set_constant_buffer
4,72% d3dadapter9.so.1.0.0 [.] nine_context_draw_indexed_primitive_priv
1,58% d3dadapter9.so.1.0.0 [.] amdgpu_do_add_real_buffer
1,31% dsound.dll.so [.] 0x000000000002997a
1,06% d3dadapter9.so.1.0.0 [.] amdgpu_cs_add_buffer
0,74% anonmap.qaDqCx (deleted) [.] 0x000000000023f415
0,65% d3dadapter9.so.1.0.0 [.] si_set_active_descriptors_for_shader
0,55% dsound.dll.so [.] 0x00000000000298e1
0,55% libc-2.28.so [.] __memmove_avx_unaligned_erms
0,54% d3dadapter9.so.1.0.0 [.] amdgpu_add_fence_dependencies_bo_list
0,49% d3dadapter9.so.1.0.0 [.] si_bind_vs_shader
0,49% [unknown] [.] 0000000000000000
0,45% d3dadapter9.so.1.0.0 [.] nine_update_state
0,44% d3dadapter9.so.1.0.0 [.] amdgpu_lookup_buffer
0,42% dsound.dll.so [.] 0x0000000000029971
Interesting how si_decompress_textures
now appears in there. I also annotated it:
│ si_blit.c:780 ▒
1,71 │ 18: push %r15 ▒
44,26 │ push %r14 ▒
7,00 │ push %r13 ▒
30,18 │ push %r12 ▒
16,11 │ push %rbp ▒
0,02 │ push %rbx ▒
│ mov %rdi,%rbx ▒
0,01 │ sub $0x28,%rsp
@dhewg Is there anything in here that stands out to you? https://src.fedoraproject.org/rpms/mesa/blob/f29/f/mesa.spec#_365
I also tried the the other suggestion by @axeldavy - here are the results.
23:43 mannerov: Venemo_j: another thing you could try is in si_pipe.c in the const_uploader = u_upload_create call to replace DEFAULT by STREAM
Here is what I experience when I run the game with this:
perf top
. Now it's just the amdgpu_mm_rreg
that is still there.This is the patch:
From 7025c93aca84713c25b3a73ff9e2db91493f217c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Timur=20Krist=C3=B3f?= <venemo@msn.com>
Date: Fri, 18 Jan 2019 14:38:49 +0100
Subject: [PATCH] Change si_create_context to use STREAM for const_uploader.
---
src/gallium/drivers/radeonsi/si_pipe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 6b36893698..ce77a4fd4c 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -409,7 +409,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
goto fail;
sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
- 0, PIPE_USAGE_DEFAULT,
+ 0, PIPE_USAGE_STREAM,
SI_RESOURCE_FLAG_32BIT |
(sscreen->cpdma_prefetch_writes_memory ?
0 : SI_RESOURCE_FLAG_READ_ONLY));
--
2.20.1
perf top
result:
12,26% [amdgpu] [k] amdgpu_mm_rreg
1,98% dsound.dll.so [.] 0x000000000002997a
0,92% d3dadapter9.so.1.0.0 [.] amdgpu_cs_add_buffer
0,81% dsound.dll.so [.] 0x00000000000298e1
0,78% libc-2.28.so [.] __memmove_avx_unaligned_erms
0,65% [kernel] [k] update_blocked_averages
0,64% anonmap.8QZ4Mi (deleted) [.] 0x0000000001c16777
0,63% dsound.dll.so [.] 0x0000000000029971
0,60% dsound.dll.so [.] 0x00000000000298c5
0,60% d3dadapter9.so.1.0.0 [.] amdgpu_lookup_buffer
0,57% d3dadapter9.so.1.0.0 [.] nine_context_set_texture
0,54% [kernel] [k] psi_task_change
0,50% d3dadapter9.so.1.0.0 [.] NineDevice9_SetTexture
OOOPS, looks like I forgot to revert the other patch before applying this one.
Will do another test with just the si_pipe patch.
Cannot reproduce here either
- Xeon E3-1231 v3 @ 3.40GHz
- R9 380X
- Linux 4.19.13
- mesa master from ~2 weeks ago
- wine 4.0~rc6
- sc2 set to 1080p with ultra gfx settings
The @Venemo, setup involves thunderbolt to run the card outside the laptop. This limits it to pcie x4 speeds.
To recreate his setup you should at least move your video card to pcie x4 slot. Most MB have only 1-2 PCIE x16 slots, and the rest are x4, even though they use the full x16 connector.
You may try it if this is not going to void your PC warranty.
Rebuilt mesa with the correct patch and updated my last comment.
@Venemo: No idea about rpm spec stuff, but nothing stands out I guess. I just tried against mesa packages from debian, and it looks like the fps is roughly the same on sc2
While the si_pipe
patch does help a lot, it's not perfect. Even though Gallium says that the GPU utilization is 99%, radeontop
disagrees, and says the utilization is ~50% (up from the previous ~30%).
@dhewg , to reproduce the issue you need to move the video card to PCIE x4 slot.
As I said in my previous comment to you, a lot of MB have full sized slots that are x4.
@iiv3 yeah, I get that it's a bottleneck that I don't run into. Just confirming that it's not a general issue
I ported the change to r600_pipe_common.c
::706 and run a benchmark.
Without the change l4d2 got 83,55 and 83,72fps.
After the change it got 84,27 and 83,92fps.
I'd say that it is a consistent improvement, even thought 0.5% is around the margin of error.
I get better performance with the constant compacting code, even though it's still not 100% usage.
With vsync disabled in-game and thread_submit=true
as well as tearfree_discard=true
my gpu load is maxed out. That is with upstream mesa, so none of those new patches.
I built the latest master (as of today) from the ixit repo and ran SC2 with the mesa built from there. The results are definitely an improvement, but not as amazing as the radeonsi patch that I tried earlier.
I also tried combining this with the patch that replaces all PIPE_USAGE_STREAM in buffer9.c and nine_buffer_upload.c with PIPE_USAGE_DEFAULT.
Investigated a bit further. I'm not 100% sure that the bottleneck is VS constants. I made the following changes:
si_pipe.c
in si_create_context
added the SI_RESOURCE_FLAG_32BIT
to stream_uploader
which allows it to receive shader constantsnine_state.c
in prepare_vs_constants_userbuf
changed the u_upload_alloc
and u_upload_unmap
calls to use the stream_uploader
This is a huge hack (the stream uploader is not supposed to handle shader constants), but it yields a huge improvement over everything else we've tried so far. It now reaches ~95 fps in the same scene. Note that this is even better than the version where I set all constants to use the stream uploader (which was ~80 fps).
Conclusion: something about the vertex shader constants in SC2 causes a bottleneck in nine when the PCI-E bandwidth is limited.
Can you try to switch const_uploader from DEFAULT to STREAM? If that's not enough, can you increase the size from 128 1024 to 1024 1024?
@marekolsak Already tried that, you can find the exact results in the comments above. As noted, setting just the VS constants to use the stream uploader yields better performance than setting the const uploader to _STREAM.
I'd like to identify the reason why stream_uploader is faster.
It's possible that const_uploader needs to use a bigger size.
@marekolsak
I've run a series of tests for you. I'm posting the patches here too. All of these were applied on top of the ixit/master branch. In every case the same SC2 replay was used and I looked at the FPS counter at the same moment during the replay.
It's possible that const_uploader needs to use a bigger size.
The maximum size of the buffer that comes from prepare_vs_constants_userbuf
is 4096 (the maximum number of shader constants in nine is 256 and each of them have the maximum size of 4 * sizeof(float)
). To confirm, I added an assert(cb.buffer_size <= 4096);
below the u_upload_alloc
call, and it is never bigger than that.
I'd like to identify the reason why stream_uploader is faster.
@axeldavy has a better understanding than me, maybe he can chime in with a better explanation, but I think it isn't really faster, this is just a symptom of an inefficiency somewhere in the code. For example, if there are a lot of unused shader constants, then the const_uploader would needlessly upload all of them, but the stream_uploader would not (as far as I understand).
In fact, changing everything to use STREAM is slower than changing just the VS constants to use STREAM.
The following patch gives ~50 fps, so no noticable improvement.
diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..8c610152942 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -434,7 +434,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
if (!sctx->b.stream_uploader)
goto fail;
- sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
+ sctx->b.const_uploader = u_upload_create(&sctx->b, 1024 * 1024,
0, PIPE_USAGE_DEFAULT,
SI_RESOURCE_FLAG_32BIT |
(sscreen->cpdma_prefetch_writes_memory ?
Gives ~75-80 fps.
diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..939bf31e46f 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -435,7 +435,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
goto fail;
sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
- 0, PIPE_USAGE_DEFAULT,
+ 0, PIPE_USAGE_STREAM,
SI_RESOURCE_FLAG_32BIT |
(sscreen->cpdma_prefetch_writes_memory ?
0 : SI_RESOURCE_FLAG_READ_ONLY));
Gives ~75-80 fps, so the same as before.
diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..1ede2de6580 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -434,8 +434,8 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
if (!sctx->b.stream_uploader)
goto fail;
- sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
- 0, PIPE_USAGE_DEFAULT,
+ sctx->b.const_uploader = u_upload_create(&sctx->b, 1024 * 1024,
+ 0, PIPE_USAGE_STREAM,
SI_RESOURCE_FLAG_32BIT |
(sscreen->cpdma_prefetch_writes_memory ?
0 : SI_RESOURCE_FLAG_READ_ONLY));
This one yields ~90 fps.
diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..ac001ddd7b7 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -430,7 +430,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
sctx->b.stream_uploader = u_upload_create(&sctx->b, 1024 * 1024,
0, PIPE_USAGE_STREAM,
- SI_RESOURCE_FLAG_READ_ONLY);
+ SI_RESOURCE_FLAG_32BIT | SI_RESOURCE_FLAG_READ_ONLY);
if (!sctx->b.stream_uploader)
goto fail;
diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index 40c6ed10446..34d7720f4cb 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -485,7 +485,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
* to have efficient WC. Thus for this case we really want
* that intermediate buffer. */
- u_upload_alloc(context->pipe->const_uploader,
+ u_upload_alloc(context->pipe->stream_uploader,
0,
cb.buffer_size,
256, /* Be conservative about alignment */
@@ -509,7 +509,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
}
}
- u_upload_unmap(context->pipe->const_uploader);
+ u_upload_unmap(context->pipe->stream_uploader);
cb.user_buffer = NULL;
/* Free previous resource */
This one yields ~80-85 fps, so worse than using just the VS constants with the stream uploader.
diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..9056268cd53 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -430,11 +430,11 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
sctx->b.stream_uploader = u_upload_create(&sctx->b, 1024 * 1024,
0, PIPE_USAGE_STREAM,
- SI_RESOURCE_FLAG_READ_ONLY);
+ SI_RESOURCE_FLAG_32BIT | SI_RESOURCE_FLAG_READ_ONLY);
if (!sctx->b.stream_uploader)
goto fail;
- sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
+ sctx->b.const_uploader = u_upload_create(&sctx->b, 4 * 1024,
0, PIPE_USAGE_DEFAULT,
SI_RESOURCE_FLAG_32BIT |
(sscreen->cpdma_prefetch_writes_memory ?
diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index 40c6ed10446..1bebdf440a5 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -485,7 +485,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
* to have efficient WC. Thus for this case we really want
* that intermediate buffer. */
- u_upload_alloc(context->pipe->const_uploader,
+ u_upload_alloc(context->pipe->stream_uploader,
0,
cb.buffer_size,
256, /* Be conservative about alignment */
@@ -494,6 +494,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
(void**)&upload_ptr);
assert(cb.buffer && upload_ptr);
+ assert(cb.buffer_size <= 4096);
if (!context->cso_shader.vs_const_ranges) {
memcpy(upload_ptr, cb.user_buffer, cb.buffer_size);
@@ -509,7 +510,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
}
}
- u_upload_unmap(context->pipe->const_uploader);
+ u_upload_unmap(context->pipe->stream_uploader);
cb.user_buffer = NULL;
/* Free previous resource */
@@ -573,7 +574,7 @@ prepare_ps_constants_userbuf(struct NineDevice9 *device)
if (!cb.buffer_size)
return;
- u_upload_alloc(context->pipe->const_uploader,
+ u_upload_alloc(context->pipe->stream_uploader,
0,
cb.buffer_size,
256, /* Be conservative about alignment */
@@ -597,7 +598,7 @@ prepare_ps_constants_userbuf(struct NineDevice9 *device)
}
}
- u_upload_unmap(context->pipe->const_uploader);
+ u_upload_unmap(context->pipe->stream_uploader);
cb.user_buffer = NULL;
/* Free previous resource */
Just a few information . If shader is using indirect addressing, loops and/or booleans, you can have slightly more than 256 constants. I guess that doesn't happen for this game. . With this git tree, there are patches to compress the constant ranges, thus nine only uploads what is going to be used by the shader, except for the case of constant indirect addressing (can only happen with vs), where we have to upload the 256 float slots. It's likely not all of them end up used, which may be why STREAM would perform better (because the transfert via the pci bus would only be for the used constants). That doesn't explain why the stream_uploader would be faster.
@axeldavy Is it possible to somehow check if the game uses indirect addressing? Just to confirm or rule out that possibility.
The game uses indirect addressing. I don't know the ratio 'constants sent because shader uses indirect addressing' over 'all constants sent', but given usually a shader needs less that 25 float[4] slots, even if a few shaders use indirect addressing, the ratio can be high.
Can we somehow eliminate the indirect addressing?
Thanks. I think the problem with DEFAULT is that more buffers are competing for the visible VRAM window, which makes the memory manager move buffers more often.
If you use STREAM too much, you might decrease GPU performance, because PCIe is slower.
The same game has good performance on pcie x16 but not on thunderbolt. That seems surpring to me that the culprit for the performance manager would move the buffers in VRAM, as else pcie x16 would be affected as well. Except maybe if the buffers get moved to GTT ? That would explain a lot.
Can you test this branch? https://cgit.freedesktop.org/~mareko/mesa/log/?h=const-uploads-sdma
@marekolsak Wow, thank you. I merged your branch with the ixit master, compiled a release build and ran the game with SDMA=true
. Now in the same scene I'm getting ~85-90 fps. GPU usage is 98% according to GALLIUM_HUD and 95% according to radeontop.
perf top
result:
9,56% [amdgpu] [k] amdgpu_mm_rreg
1,12% libc-2.28.so [.] __memmove_avx_unaligned_erms
0,99% d3dadapter9.so.1.0.0 [.] nine_update_state
0,86% d3dadapter9.so.1.0.0 [.] amdgpu_lookup_or_add_slab_buffer
0,84% d3dadapter9.so.1.0.0 [.] amdgpu_cs_add_buffer
0,68% [kernel] [k] update_blocked_averages
0,68% d3dadapter9.so.1.0.0 [.] amdgpu_add_fence_dependencies_bo_list
0,61% libc-2.28.so [.] __memcmp_avx2_movbe
0,45% d3dadapter9.so.1.0.0 [.] nine_context_set_texture
0,40% [kernel] [k] psi_task_change
0,37% d3dadapter9.so.1.0.0 [.] si_emit_spi_map
0,36% d3dadapter9.so.1.0.0 [.] NineDevice9_SetTexture
Some more experiments that I tried:
Thanks that you added SDMA=true. I forgot to tell you about it.
I've updated the branch with a performance fix. It should now reach 100% GPU load: https://cgit.freedesktop.org/~mareko/mesa/log/?h=const-uploads-sdma
@marekolsak, would it be possible to implement the same changes for the r600 driver?
@marekolsak I've ran another series of tests for you.
Branch | FPS | GTT | VRAM | GPU use | hotspot |
---|---|---|---|---|---|
ixit/master | 55 | 60 MB | 1.7 GB | 45-50% | 25% nine_update_state 20% amdgpu_mm_reg |
mareko/ const-uploads-sdma |
105 | 70 MB | 1.7 GB | 100% | 10% amdgpu_mm_reg |
merged | 105-110 | 70 MB | 1.7 GB | 100% | 10% amdgpu_mm_reg |
merged + this patch |
110 | 45 MB | 1.7 GB | 100% | 10% amdgpu_mm_reg |
Conclusions:
amdgpu_mm_reg
is now the hot spot according to perf top
. Would be interesting to find out why.@marekolsak, would it be possible to implement the same changes for the r600 driver?
It would be possible, but the performance improvement wouldn't be as high, because the DMA engine is not asynchronous on radeon. For r600, I recommend changing const_uploader from DEFAULT to STREAM.
@Venemo My series decreases performance for trivial benchmarks by 20%, so I need some amdgpu kernel changes to get the performance back.
Surprising. Any idea what bad kernel behaviour is causing the performance drop ?
Surprising. Any idea what bad kernel behaviour is causing the performance drop ?
We submit 2 CS ioctls. We need to merge them into 1.
@marekolsak Can you shed some light on what amdgpu_mm_reg
is any why is it the hot spot? As far as I understand it has to do with memory-mapped registers but couldn't quite figure out why it is eating 10% CPU time in the game.
amdgpu_mm_reg is used for getting the GPU load.
Long story short, I observed a scenario where I get low frame rates even though the CPU and GPU utilization was low. This is a continuation of this issue where we started discussing a performance problem with Starcraft 2 when running on Gallium Nine standalone.
Hardware details: the system is a Dell XPS 13 9370 with a 4-core / 8-thread Intel 8550U, a Sapphire AMD Radeon RX 570 ITX card is connected through Thunderbolt 3 (which is roughly equivalent to a x4 PCI-E connection), and a 4K Dell U2718Q screen is connected to the RX 570's DisplayPort output.
I didn't measure the frame rates in a scientific way, but the difference is very noticable:
Some observations from the other thread that are worth to mention here:
GALLIUM_HUD
CPU utilization is around 30% and GPU utilization is around 50%perf top
the majority of time is spent in functions likesi_set_constant_buffer
amdgpu_mm_rreg
andNineDevice9_SetIndices
from which I got the impression that some buffers are copied to/from the GPU and that may be a problem.@axeldavy
That does sound awesome, and makes me think that the problem may be with my setup, or that maybe my setup triggers some sort of corner case within Nine. After thinking about it more, I've got three ideas:
Idea 1. Maybe there's something wrong with my kernel command line? Bascially what I do is disable all the spectre / meltdown mitigations, I enable the PowerPlay features in amdgpu and blacklist the
i915
module.Idea 2. Maybe somehow Nine performs more IO through PCI-E than wined3d and the Thunderbolt 3 port is the bottleneck. Is it a possibility that there are some operations within Nine that are not noticable when using a PCI-E x16 connection but become problematic when running over PCI-E x4? I don't know how to verify this theory, but maybe there is a way to check the PCI-E port usage.
Idea 3. It occours to me that I installed DirectX 9 from
winetricks
in this Wine prefix before I started using Nine. Is it possible that this interferes with Nine somehow?@iiv3
No, I've got 4.19 from Fedora:
4.19.13-300.fc29.x86_64
Good idea, I will try that.
Sure, I'll take a look, though I'm afraid I'm not an expert in
perf
.Sorry for the long post!