iXit / Mesa-3D

Please use official https://gitlab.freedesktop.org/mesa/mesa/ !
https://github.com/iXit/Mesa-3D/wiki
66 stars 13 forks source link

Low frame rates on Starcraft 2 while CPU and GPU utilization is low #333

Open Venemo opened 5 years ago

Venemo commented 5 years ago

Long story short, I observed a scenario where I get low frame rates even though the CPU and GPU utilization was low. This is a continuation of this issue where we started discussing a performance problem with Starcraft 2 when running on Gallium Nine standalone.

Hardware details: the system is a Dell XPS 13 9370 with a 4-core / 8-thread Intel 8550U, a Sapphire AMD Radeon RX 570 ITX card is connected through Thunderbolt 3 (which is roughly equivalent to a x4 PCI-E connection), and a 4K Dell U2718Q screen is connected to the RX 570's DisplayPort output.

I didn't measure the frame rates in a scientific way, but the difference is very noticable:

Some observations from the other thread that are worth to mention here:

@axeldavy

On the first campain, after the videos, I get about 140 fps on full hd, everything maxed out on my radeon rx 480. The GPU load is about 90%. My 4 cpu threads are all around 60-70%.. I used GALLIUM_HUD=GPU-load,fps,cpu0+cpu1+cpu2+cpu3,shader-clock+memory-clock+temperature tearfree_discard=true WINEDEBUG=-all

That does sound awesome, and makes me think that the problem may be with my setup, or that maybe my setup triggers some sort of corner case within Nine. After thinking about it more, I've got three ideas:

Idea 1. Maybe there's something wrong with my kernel command line? Bascially what I do is disable all the spectre / meltdown mitigations, I enable the PowerPlay features in amdgpu and blacklist the i915 module.

resume=UUID=0dc28d3d-cf9a-4a1d-b980-e7f78ad7aaee rd.luks.uuid=luks-98f2c2d3-77e1-444a-b3f2-d3396b53e16e rhgb quiet mem_sleep_default=deep pti=off spectre_v2=off l1tf=off nospec_store_bypass_disable no_stf_barrier amdgpu.ppfeaturemask=0xffffffff i915.enable_guc=0 module_blacklist=i915 3

Idea 2. Maybe somehow Nine performs more IO through PCI-E than wined3d and the Thunderbolt 3 port is the bottleneck. Is it a possibility that there are some operations within Nine that are not noticable when using a PCI-E x16 connection but become problematic when running over PCI-E x4? I don't know how to verify this theory, but maybe there is a way to check the PCI-E port usage.

Idea 3. It occours to me that I installed DirectX 9 from winetricks in this Wine prefix before I started using Nine. Is it possible that this interferes with Nine somehow?

@iiv3

do you compile your own kernel? If so, you might have some additional tracing enabled.

No, I've got 4.19 from Fedora: 4.19.13-300.fc29.x86_64

If not... Could you try to disable "iommu=off", in case you are hitting some "soft" emulation that involves page faults.

Good idea, I will try that.

If this doesn't help either, try entering the "amdgpu_mm_rreg" in perf top and see which instructions are getting most heat.

Sure, I'll take a look, though I'm afraid I'm not an expert in perf.

Sorry for the long post!

iiv3 commented 5 years ago

Actually most motherboards have very few PCIE x16 slots, usually one or two. While there are might be more full sized PCIE slots, they are actually working as x4 ones. So maybe @axeldavy could try and test his video card on x4 PCIE slot.

@Venemo, You could try radeon top program, that shows utilization of specific GPU subsystems. It doesn't sound like the bottleneck is there, but it might give some useful hint.

You could also take a look what options are available for you in R600_DEBUG, despite its name it works on latest radeons too. Do an R600_DEBUG=help glxgears and you will get all available options (some drivers have more). Try everything that is not about dumping stuff to the console. (Try it one by one.)

Thunderbold3 definitely complicates things. I would add testing all working iommu types, in case you do need a hardware support for it.

I had a similar slowdown caused by the fact that my distribution had i586 optimized glibc, that used a memcpy() with hack for original Pentium. It would read from the destination memory, before overwriting it with real data. When used over my PCIE x16 it would get something like 25% CPU in perf top and 50 fps instead of 70fps (benchmark avg).

This reminds me... You had two separate perf top logs that showed si_set_constant_buffer and amdgpu_mm_rreg as top usage functions... but I could not quite understand what have you done differently to get one or the other result. It's not compilation or inline-ing, as one is from mesa3d and the other is from the kernel.

If you are compiling your own mesa3d try to use -O1 or -Og, it might give some hints. Actually compiling with debug support would be very good idea. When you run perf top you can select function with arrow keys. Press enter once, then press it second time to enter into Annotate. In that mode it will show you how hot each instructions from that functions are. If you have debug support it might show you even the source.

Also, I've tried to make a line that would show all graphs on the screen, try it in case it provides some hint.

export GALLIUM_HUD=\
".dfps+cpu+GPU-load+temperature,.dGTT-usage+VRAM-usage,num-compilations+num-shaders-created;"\
"primitives-generated+draw-calls,samples-passed+ps-invocations+vs-invocations,buffer-wait-time;"\
"CS-thread-busy+gallium-thread-busy,dma-calls+cp-dma-calls,num-bytes-moved;"\
"num-vs-flushes+num-ps-flushes+num-cs-flushes+num-CB-cache-flushes+num-DB-cache-flushes"

You can check if you have an extra graphs with GALLIUM_HUD=help glxgears and add them too.

Venemo commented 5 years ago

@iiv3 Thanks for the suggestions! I'll try them.

I had a similar slowdown caused by the fact that my distribution had i586 optimized glibc, that used a memcpy() with hack for original Pentium. It would read from the destination memory, before overwriting it with real data.

What was the solution to this one? How do I check if this is the case?

You had two separate perf top logs that showed si_set_constant_buffer and amdgpu_mm_rreg as top usage functions... but I could not quite understand what have you done differently to get one or the other result.

I honestly don't know, it is possible that I have messed up something in that WINEPREFIX. I will re-install the game in a clean prefix and use that for further testing.

axeldavy commented 5 years ago

Every frame, vertices must be sent to the graphic card for the draw calls. There's also the draw commands sent by the driver. Good games don't upload anything else.

As mentionned already, the vertices for SC2 are stored in a permanently mapped buffer. The allocation is done in GTT, ie graphic memory in cpu ram. At draw time, the graphic card fetches the data there. The thing is if that was the limiting factor, the GPU-load would be higher: indeed the GPU is running shaders when fetching these vertice data, and it should count as GPU runnning. Still, you can try use VRAM for these allocations (the size of the mappable VRAM is restricted to 64MB, but it should be enough there). Replace all PIPE_USAGE_STREAM in buffer9.c and nine_buffer_upload.c to PIPE_USAGE_DEFAULT. Possibly in your case it should be faster as the data would make it to the GPU sooner.

As for the draw commands, there's not much to do.

Maybe as iive suggested you have wrong flags somewhere which leads to unoptimal functions somewhere.

Venemo commented 5 years ago

Okay, so I upgraded to Fedora 29, kernel 5.0-rc1 and wine staging 4.0-rc1. Then I deleted the wine prefix and reinstalled the game in a clean prefix.

If this doesn't help either, try entering the "amdgpu_mm_rreg" in perf top and see which instructions are getting most heat.

There is a mov instruction which takes ~96% of the time spent in that function.

You could try radeon top program, that shows utilization of specific GPU subsystems. It doesn't sound like the bottleneck is there, but it might give some useful hint.

Here is the output. I couldn't copy-paste the bars from the terminal, but the percentages are clearly visible.

                         Graphics pipe  39,17% │
───────────────────────────────────────────────┼─
                 Event          Engine   0,00% │
                                               │
           Vertex Grouper + Tesselator  10,00% │
                                               │
                     Texture Addresser  24,17% │
                                               │
                         Shader Export  31,67% │
           Sequencer Instruction Cache   8,33% │
                   Shader Interpolator  33,33% │
                                               │
                        Scan Converter  34,17% │
                    Primitive Assembly  10,83% │
                                               │
                           Depth Block  30,83% │
                           Color Block  30,00% │
                                               │
                    2815M / 3995M VRAM  70,44% │
                       73M / 4093M GTT   1,79% │
            1,75G / 1,75G Memory Clock 100,00% │
            1,24G / 1,24G Shader Clock 100,00% │

You could also take a look what options are available for you in R600_DEBUG

I took a look, and tried a few options but some of them caused a crash, while others caused noticable performance decrease, so I didn't get too far with it. Is there a specific option you want me to try?

I would add testing all working iommu types, in case you do need a hardware support for it.

With iommu=off my wifi card is not recognized, which is a problem. With iommu=on intel_iommu=on I get roughly +5 fps, and the "gallium thread busy" line from GALLIUM_HUD goes to zero. So I guess that is a good thing at least.

If you are compiling your own mesa3d try to use -O1 or -Og, it might give some hints. Actually compiling with debug support would be very good idea.

I just use the mesa that comes with Fedora. I'll also try the version from the "che" repo which is supposed to be a newer version with more optimizations enabled, will report back if it helps.

Also, I've tried to make a line that would show all graphs on the screen, try it in case it provides some hint.

Here is a screenshot from the graphs with intel_iommu=on:

Here is a graph

Venemo commented 5 years ago

Oops, I accidentally hit "close and comment", sorry. Reopening it now.

axeldavy commented 5 years ago

performance

Venemo commented 5 years ago

@axeldavy Thanks! How come that the samples passed and the ps/vs invocations are so low on your machine? Also the primitives generated seem much less than mine.

axeldavy commented 5 years ago

It depends on the scene and resolution.

Venemo commented 5 years ago

Okay, I tried the patch suggested by @axeldavy and replaced PIPE_USAGE_STREAM to PIPE_USAGE_DEFAULT in the specified files. It definitely helps, but does not solve the problem.

Here are the graphs from the patched nine:

Graphs from patched nine

Hotspots from perf top:

  13,66%  d3dadapter9.so.1.0.0      [.] si_set_constant_buffer
   4,88%  anonmap.WTz65d (deleted)  [.] 0x000000000023f415
   2,69%  d3dadapter9.so.1.0.0      [.] NineDevice9_SetIndices
   2,57%  d3dadapter9.so.1.0.0      [.] amdgpu_do_add_real_buffer
   2,19%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   1,30%  dsound.dll.so             [.] 0x000000000002997a
   1,07%  anonmap.WTz65d (deleted)  [.] 0x000000000023f403
   0,63%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,59%  anonmap.WTz65d (deleted)  [.] 0x0000000001272ed6
   0,58%  dsound.dll.so             [.] 0x00000000000298e1

Here is the hot spot from the annotated si_set_constant_buffer:

       │     si_descriptors.c:1212                                                                                                                                                                                  ▒
  0,01 │       mov    (%rsi),%rax                                                                                                                                                                                   ▒
  0,00 │       lea    (%rax,%r14,8),%rdx                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:139                                                                                                                                                 ▒
  0,07 │       mov    (%rdx),%rsi                                                                                                                                                                                   ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:79                                                                                                                                                  ▒
  0,01 │       cmp    %rsi,%r8                                                                                                                                                                                      ▒
       │     ↓ je     8c                                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:87                                                                                                                                                  ▒
  0,00 │       test   %rsi,%rsi                                                                                                                                                                                     ▒
       │     ↓ je     8c                                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:89                                                                                                                                                  ▒
 98,08 │       lock   subl   $0x1,(%rsi)                                                                                                                                                                            ▒
  1,16 │     ↓ jne    8c                                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:146                                                                                                                                                 ▒
  0,00 │ 4d:   mov    0x20(%rsi),%rax                                                                                                                                                                               ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:148                                                                                                                                                 ▒
       │       mov    0x28(%rsi),%rcx                                                                                                                                                                               ▒
       │       mov    %rdx,0x8(%rsp)   

I also examined the same graph while running wined3d with the PBA patch:

Graphs from wined3d PBA

Hotspots according to perf top:

  16,67%  radeonsi_dri.so           [.] u_upload_alloc                                                                                                                                                              ◆
   3,12%  radeonsi_dri.so           [.] cso_set_vertex_buffers                                                                                                                                                      ▒
   2,27%  radeonsi_dri.so           [.] tc_call_draw_vbo                                                                                                                                                            ▒
   1,84%  radeonsi_dri.so           [.] amdgpu_cs_add_buffer                                                                                                                                                        ▒
   1,68%  libc-2.28.so              [.] __memmove_avx_unaligned_erms                                                                                                                                                ▒
   1,53%  radeonsi_dri.so           [.] amdgpu_do_add_real_buffer                                                                                                                                                   ▒
   1,25%  wined3d.dll.so            [.] 0x000000000005a390                                                                                                                                                          ▒
   1,20%  wined3d.dll.so            [.] 0x000000000005b5e4                                                                                                                                                          ▒
   1,17%  wined3d.dll.so            [.] 0x000000000005b870                                                                                                                                                          ▒
   0,90%  libpthread-2.28.so        [.] __pthread_mutex_lock                                                                                                                                                        ▒
   0,79%  ntdll.dll.so              [.] RtlEnterCriticalSection                                                                                                                                                     ▒
   0,73%  wined3d.dll.so            [.] 0x000000000005a3af                                                                                                                                                          ▒
   0,66%  wined3d.dll.so            [.] 0x000000000005a39a                                                                                                                                                          ▒
   0,64%  wined3d.dll.so            [.] 0x000000000005b52d                                                                                                                                                          ▒
   0,63%  wined3d.dll.so            [.] 0x000000000005b5d0                                                                                                                                                          ▒
   0,63%  wined3d.dll.so            [.] 0x000000000005b670                                                                                                                                                          ▒
   0,61%  wined3d.dll.so            [.] 0x000000000005b4a6                                                                                                                                                          ▒
   0,61%  ntdll.dll.so              [.] RtlLeaveCriticalSection                                                                                                                                                     ▒
   0,59%  [amdgpu]                  [k] amdgpu_mm_rreg                                                                                                                                                              ▒
   0,57%  wined3d.dll.so            [.] 0x000000000005aad0                                                                                                                                                          ▒
   0,57%  libpthread-2.28.so        [.] __pthread_mutex_unlock_usercnt                                                                                                                                              ▒
   0,53%  wined3d.dll.so            [.] 0x000000000005b5ed                                                                                                                                                          ▒
   0,51%  radeonsi_dri.so           [.] _mesa_unmarshal_dispatch_cmd                                                                                                                                                ▒
   0,50%  wined3d.dll.so            [.] 0x000000000005aabd    

Which means that wined3d+PBA still beats nine by a dozen or so FPS. What stands out to me is:

Further notes:

Where do we go from here?

Venemo commented 5 years ago

Here is what lspci -vvv has to say about the GPU:

08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef) (prog-if 00 [VGA controller])
    Subsystem: Sapphire Technology Limited Device e343
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 128 bytes
    Interrupt: pin A routed to IRQ 156
    Region 0: Memory at 60000000 (64-bit, prefetchable) [size=256M]
    Region 2: Memory at 70000000 (64-bit, prefetchable) [size=2M]
    Region 4: I/O ports at 2000 [size=256]
    Region 5: Memory at ac000000 (32-bit, non-prefetchable) [size=256K]
    Expansion ROM at ac040000 [disabled] [size=128K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
        LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <1us
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 8GT/s (ok), Width x4 (downgraded)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
             AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
             AtomicOpsCtl: ReqEn-
        LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
             EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
    Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fee006b8  Data: 0000
    Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [150 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [200 v1] Resizable BAR <?>
    Capabilities: [270 v1] Secondary PCI Express <?>
    Capabilities: [2b0 v1] Address Translation Service (ATS)
        ATSCap: Invalidate Queue Depth: 00
        ATSCtl: Enable-, Smallest Translation Unit: 00
    Capabilities: [2c0 v1] Page Request Interface (PRI)
        PRICtl: Enable- Reset-
        PRISta: RF- UPRGI- Stopped+
        Page Request Capacity: 00000020, Page Request Allocation: 00000000
    Capabilities: [2d0 v1] Process Address Space ID (PASID)
        PASIDCap: Exec+ Priv+, Max PASID Width: 10
        PASIDCtl: Enable- Exec- Priv-
    Capabilities: [320 v1] Latency Tolerance Reporting
        Max snoop latency: 71680ns
        Max no snoop latency: 71680ns
    Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap: MFVC- ACS-, Next Function: 1
        ARICtl: MFVC- ACS-, Function Group: 0
    Capabilities: [370 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
              PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
               T_CommonMode=0us LTR1.2_Threshold=0ns
        L1SubCtl2: T_PwrOn=10us
    Kernel driver in use: amdgpu
    Kernel modules: amdgpu

Here is what amdgpu's sysfs has to say:

[root@timur-xps ~]# cat /sys/class/drm/card1/device/pp_dpm_pcie 
0: 2.5GT/s, x8 
1: 8.0GT/s, x16 *

Reducing the bandwidth using echo 0 > /sys/class/drm/card1/device/pp_dpm_pcie did not seem to have any effect on the frame rate, but I'm honestly not sure how reliable pp_dpm_pcie is. Or maybe the TB3 bottleneck is more severe than the restriction that it can impose.

Trying to determine the available bandwidth using clpeak is not possible due to the missing PCI-E atomics support:

kfd kfd: skipped device 1002:67df, PCI rejects atomics

However this egpu.io page states the following: Intel throttles 32Gbps-TB3 to 22Gbps which benchmarks as 20.40Gbps (AMD) or 18.91Gbps (Nvidia) in the important H2D direction.

Venemo commented 5 years ago

I took a look at the hot spots for Nine in perf top and investigated a bit:

si_set_constant_buffer -> pipe_resource_reference -> pipe_reference_described -> p_atomic_dec_zero which is a call to either __atomic_sub_fetch or __sync_sub_and_fetch (preferring the __atomic version if available)

Looking at the other items like NineDevice9_SetIndices and amdgpu_do_add_real_buffer, all of them wait for a similar synchronization primitive.

dhewg commented 5 years ago

Cannot reproduce here either

without vsync: screenshot_2019-01-18_13-10-30 and with vsync: screenshot_2019-01-18_13-56-08

Venemo commented 5 years ago

@dhewg @axeldavy Which distro do you guys use? Just asking because I might just try running that distro as an experiment. Maybe there is some package that is better optimized there than here, and maybe that makes the difference. Or maybe mesa itself is compiled with different flags?

dhewg commented 5 years ago

I'm on Debian, @axeldavy on Arch iirc. I built mesa myself though, using CFLAGS="-march=native" CXXFLAGS="-march=native" meson ~/src/mesa --prefix /opt/andre/mesa -Ddri-drivers= -Dgallium-drivers=radeonsi,swrast -Dvulkan-drivers=amd -Dgallium-nine=true -Dosmesa=gallium

Venemo commented 5 years ago

Okay, so @axeldavy Suggested an approach that helps find whether the constant data is the bottleneck or not:

08:20 mannerov: in nine_state.c 08:20 mannerov: prepare_vs_constants_userbuf 08:20 mannerov: when cb.buffer_size is set 08:21 mannerov: set it to the maximum value instead, that is sizeof(float[4]) * NINE_MAX_CONST_ALL 08:21 mannerov: if it doesn't decrease fps, it is not limiting

Here is how the patch file looks:

From debd5413b1d14a28f26ae40c1d907df621044c8a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Timur=20Krist=C3=B3f?= <venemo@msn.com>
Date: Fri, 18 Jan 2019 14:38:49 +0100
Subject: [PATCH] Change prepare_vs_constants_userbuf to always use the maximum
 size.

---
 src/gallium/state_trackers/nine/nine_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index bf39ebc9b4..209518ac2e 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -431,7 +431,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
     struct pipe_constant_buffer cb;
     cb.buffer = NULL;
     cb.buffer_offset = 0;
-    cb.buffer_size = context->vs->const_used_size;
+    cb.buffer_size = sizeof(float[4]) * NINE_MAX_CONST_ALL;
     cb.user_buffer = context->vs_const_f;

     if (context->swvp) {
-- 
2.20.1

When I run the game with this patch:

Now the perf top looks like this:

  17,21%  d3dadapter9.so.1.0.0      [.] si_decompress_textures
  12,59%  d3dadapter9.so.1.0.0      [.] si_set_constant_buffer
   4,72%  d3dadapter9.so.1.0.0      [.] nine_context_draw_indexed_primitive_priv
   1,58%  d3dadapter9.so.1.0.0      [.] amdgpu_do_add_real_buffer
   1,31%  dsound.dll.so             [.] 0x000000000002997a
   1,06%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   0,74%  anonmap.qaDqCx (deleted)  [.] 0x000000000023f415
   0,65%  d3dadapter9.so.1.0.0      [.] si_set_active_descriptors_for_shader
   0,55%  dsound.dll.so             [.] 0x00000000000298e1
   0,55%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,54%  d3dadapter9.so.1.0.0      [.] amdgpu_add_fence_dependencies_bo_list
   0,49%  d3dadapter9.so.1.0.0      [.] si_bind_vs_shader
   0,49%  [unknown]                 [.] 0000000000000000
   0,45%  d3dadapter9.so.1.0.0      [.] nine_update_state
   0,44%  d3dadapter9.so.1.0.0      [.] amdgpu_lookup_buffer
   0,42%  dsound.dll.so             [.] 0x0000000000029971

Interesting how si_decompress_textures now appears in there. I also annotated it:

       │     si_blit.c:780                                                                                                                                                                                          ▒
  1,71 │ 18:   push   %r15                                                                                                                                                                                          ▒
 44,26 │       push   %r14                                                                                                                                                                                          ▒
  7,00 │       push   %r13                                                                                                                                                                                          ▒
 30,18 │       push   %r12                                                                                                                                                                                          ▒
 16,11 │       push   %rbp                                                                                                                                                                                          ▒
  0,02 │       push   %rbx                                                                                                                                                                                          ▒
       │       mov    %rdi,%rbx                                                                                                                                                                                     ▒
  0,01 │       sub    $0x28,%rsp   
Venemo commented 5 years ago

@dhewg Is there anything in here that stands out to you? https://src.fedoraproject.org/rpms/mesa/blob/f29/f/mesa.spec#_365

Venemo commented 5 years ago

I also tried the the other suggestion by @axeldavy - here are the results.

23:43 mannerov: Venemo_j: another thing you could try is in si_pipe.c in the const_uploader = u_upload_create call to replace DEFAULT by STREAM

Here is what I experience when I run the game with this:

This is the patch:

From 7025c93aca84713c25b3a73ff9e2db91493f217c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Timur=20Krist=C3=B3f?= <venemo@msn.com>
Date: Fri, 18 Jan 2019 14:38:49 +0100
Subject: [PATCH] Change si_create_context to use STREAM for const_uploader.

---
 src/gallium/drivers/radeonsi/si_pipe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 6b36893698..ce77a4fd4c 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -409,7 +409,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
        goto fail;

    sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
-                          0, PIPE_USAGE_DEFAULT,
+                          0, PIPE_USAGE_STREAM,
                           SI_RESOURCE_FLAG_32BIT |
                           (sscreen->cpdma_prefetch_writes_memory ?
                                0 : SI_RESOURCE_FLAG_READ_ONLY));
-- 
2.20.1

perf top result:

  12,26%  [amdgpu]                  [k] amdgpu_mm_rreg
   1,98%  dsound.dll.so             [.] 0x000000000002997a
   0,92%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   0,81%  dsound.dll.so             [.] 0x00000000000298e1
   0,78%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,65%  [kernel]                  [k] update_blocked_averages
   0,64%  anonmap.8QZ4Mi (deleted)  [.] 0x0000000001c16777
   0,63%  dsound.dll.so             [.] 0x0000000000029971
   0,60%  dsound.dll.so             [.] 0x00000000000298c5
   0,60%  d3dadapter9.so.1.0.0      [.] amdgpu_lookup_buffer
   0,57%  d3dadapter9.so.1.0.0      [.] nine_context_set_texture
   0,54%  [kernel]                  [k] psi_task_change
   0,50%  d3dadapter9.so.1.0.0      [.] NineDevice9_SetTexture
Venemo commented 5 years ago

OOOPS, looks like I forgot to revert the other patch before applying this one.

Will do another test with just the si_pipe patch.

iiv3 commented 5 years ago

Cannot reproduce here either

  • Xeon E3-1231 v3 @ 3.40GHz
  • R9 380X
  • Linux 4.19.13
  • mesa master from ~2 weeks ago
  • wine 4.0~rc6
  • sc2 set to 1080p with ultra gfx settings

The @Venemo, setup involves thunderbolt to run the card outside the laptop. This limits it to pcie x4 speeds.

To recreate his setup you should at least move your video card to pcie x4 slot. Most MB have only 1-2 PCIE x16 slots, and the rest are x4, even though they use the full x16 connector.

You may try it if this is not going to void your PC warranty.

Venemo commented 5 years ago

Rebuilt mesa with the correct patch and updated my last comment.

dhewg commented 5 years ago

@Venemo: No idea about rpm spec stuff, but nothing stands out I guess. I just tried against mesa packages from debian, and it looks like the fps is roughly the same on sc2

Venemo commented 5 years ago

While the si_pipe patch does help a lot, it's not perfect. Even though Gallium says that the GPU utilization is 99%, radeontop disagrees, and says the utilization is ~50% (up from the previous ~30%).

iiv3 commented 5 years ago

@dhewg , to reproduce the issue you need to move the video card to PCIE x4 slot.

As I said in my previous comment to you, a lot of MB have full sized slots that are x4.

dhewg commented 5 years ago

@iiv3 yeah, I get that it's a bottleneck that I don't run into. Just confirming that it's not a general issue

iiv3 commented 5 years ago

I ported the change to r600_pipe_common.c::706 and run a benchmark. Without the change l4d2 got 83,55 and 83,72fps. After the change it got 84,27 and 83,92fps.

I'd say that it is a consistent improvement, even thought 0.5% is around the margin of error.

axeldavy commented 5 years ago

newscreenshot I get better performance with the constant compacting code, even though it's still not 100% usage.

dhewg commented 5 years ago

With vsync disabled in-game and thread_submit=true as well as tearfree_discard=true my gpu load is maxed out. That is with upstream mesa, so none of those new patches.

Venemo commented 5 years ago

I built the latest master (as of today) from the ixit repo and ran SC2 with the mesa built from there. The results are definitely an improvement, but not as amazing as the radeonsi patch that I tried earlier.

I also tried combining this with the patch that replaces all PIPE_USAGE_STREAM in buffer9.c and nine_buffer_upload.c with PIPE_USAGE_DEFAULT.

Venemo commented 5 years ago

Investigated a bit further. I'm not 100% sure that the bottleneck is VS constants. I made the following changes:

This is a huge hack (the stream uploader is not supposed to handle shader constants), but it yields a huge improvement over everything else we've tried so far. It now reaches ~95 fps in the same scene. Note that this is even better than the version where I set all constants to use the stream uploader (which was ~80 fps).

Conclusion: something about the vertex shader constants in SC2 causes a bottleneck in nine when the PCI-E bandwidth is limited.

marekolsak commented 5 years ago

Can you try to switch const_uploader from DEFAULT to STREAM? If that's not enough, can you increase the size from 128 1024 to 1024 1024?

Venemo commented 5 years ago

@marekolsak Already tried that, you can find the exact results in the comments above. As noted, setting just the VS constants to use the stream uploader yields better performance than setting the const uploader to _STREAM.

marekolsak commented 5 years ago

I'd like to identify the reason why stream_uploader is faster.

marekolsak commented 5 years ago

It's possible that const_uploader needs to use a bigger size.

Venemo commented 5 years ago

@marekolsak

I've run a series of tests for you. I'm posting the patches here too. All of these were applied on top of the ixit/master branch. In every case the same SC2 replay was used and I looked at the FPS counter at the same moment during the replay.

It's possible that const_uploader needs to use a bigger size.

The maximum size of the buffer that comes from prepare_vs_constants_userbuf is 4096 (the maximum number of shader constants in nine is 256 and each of them have the maximum size of 4 * sizeof(float)). To confirm, I added an assert(cb.buffer_size <= 4096); below the u_upload_alloc call, and it is never bigger than that.

I'd like to identify the reason why stream_uploader is faster.

@axeldavy has a better understanding than me, maybe he can chime in with a better explanation, but I think it isn't really faster, this is just a symptom of an inefficiency somewhere in the code. For example, if there are a lot of unused shader constants, then the const_uploader would needlessly upload all of them, but the stream_uploader would not (as far as I understand).

In fact, changing everything to use STREAM is slower than changing just the VS constants to use STREAM.

Increasing the default size of the const uploader

The following patch gives ~50 fps, so no noticable improvement.

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..8c610152942 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -434,7 +434,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
        if (!sctx->b.stream_uploader)
                goto fail;

-       sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
+       sctx->b.const_uploader = u_upload_create(&sctx->b, 1024 * 1024,
                                                   0, PIPE_USAGE_DEFAULT,
                                                   SI_RESOURCE_FLAG_32BIT |
                                                   (sscreen->cpdma_prefetch_writes_memory ?

Changing the const uploader to STREAM

Gives ~75-80 fps.

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..939bf31e46f 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -435,7 +435,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
                goto fail;

        sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
-                                                  0, PIPE_USAGE_DEFAULT,
+                                                  0, PIPE_USAGE_STREAM,
                                                   SI_RESOURCE_FLAG_32BIT |
                                                   (sscreen->cpdma_prefetch_writes_memory ?
                                                            0 : SI_RESOURCE_FLAG_READ_ONLY));

Changing the const uploader to STREAM while also increasing its default size

Gives ~75-80 fps, so the same as before.

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..1ede2de6580 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -434,8 +434,8 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
        if (!sctx->b.stream_uploader)
                goto fail;

-       sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
-                                                  0, PIPE_USAGE_DEFAULT,
+       sctx->b.const_uploader = u_upload_create(&sctx->b, 1024 * 1024,
+                                                  0, PIPE_USAGE_STREAM,
                                                   SI_RESOURCE_FLAG_32BIT |
                                                   (sscreen->cpdma_prefetch_writes_memory ?
                                                            0 : SI_RESOURCE_FLAG_READ_ONLY));

Hack that makes the VS constants use the steam uploader

This one yields ~90 fps.

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..ac001ddd7b7 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -430,7 +430,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,

        sctx->b.stream_uploader = u_upload_create(&sctx->b, 1024 * 1024,
                                                    0, PIPE_USAGE_STREAM,
-                                                   SI_RESOURCE_FLAG_READ_ONLY);
+                                                   SI_RESOURCE_FLAG_32BIT | SI_RESOURCE_FLAG_READ_ONLY);
        if (!sctx->b.stream_uploader)
                goto fail;

diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index 40c6ed10446..34d7720f4cb 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -485,7 +485,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
      * to have efficient WC. Thus for this case we really want
      * that intermediate buffer. */

-    u_upload_alloc(context->pipe->const_uploader,
+    u_upload_alloc(context->pipe->stream_uploader,
                   0,
                   cb.buffer_size,
                   256, /* Be conservative about alignment */
@@ -509,7 +509,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
         }
     }

-    u_upload_unmap(context->pipe->const_uploader);
+    u_upload_unmap(context->pipe->stream_uploader);
     cb.user_buffer = NULL;

     /* Free previous resource */

Hacking VS and PS contants to use the stream uploader

This one yields ~80-85 fps, so worse than using just the VS constants with the stream uploader.

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 92ce1e0699c..9056268cd53 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -430,11 +430,11 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,

        sctx->b.stream_uploader = u_upload_create(&sctx->b, 1024 * 1024,
                                                    0, PIPE_USAGE_STREAM,
-                                                   SI_RESOURCE_FLAG_READ_ONLY);
+                                                   SI_RESOURCE_FLAG_32BIT | SI_RESOURCE_FLAG_READ_ONLY);
        if (!sctx->b.stream_uploader)
                goto fail;

-       sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
+       sctx->b.const_uploader = u_upload_create(&sctx->b, 4 * 1024,
                                                   0, PIPE_USAGE_DEFAULT,
                                                   SI_RESOURCE_FLAG_32BIT |
                                                   (sscreen->cpdma_prefetch_writes_memory ?
diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index 40c6ed10446..1bebdf440a5 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -485,7 +485,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
      * to have efficient WC. Thus for this case we really want
      * that intermediate buffer. */

-    u_upload_alloc(context->pipe->const_uploader,
+    u_upload_alloc(context->pipe->stream_uploader,
                   0,
                   cb.buffer_size,
                   256, /* Be conservative about alignment */
@@ -494,6 +494,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
                   (void**)&upload_ptr);

     assert(cb.buffer && upload_ptr);
+    assert(cb.buffer_size <= 4096);

     if (!context->cso_shader.vs_const_ranges) {
         memcpy(upload_ptr, cb.user_buffer, cb.buffer_size);
@@ -509,7 +510,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
         }
     }

-    u_upload_unmap(context->pipe->const_uploader);
+    u_upload_unmap(context->pipe->stream_uploader);
     cb.user_buffer = NULL;

     /* Free previous resource */
@@ -573,7 +574,7 @@ prepare_ps_constants_userbuf(struct NineDevice9 *device)
     if (!cb.buffer_size)
         return;

-    u_upload_alloc(context->pipe->const_uploader,
+    u_upload_alloc(context->pipe->stream_uploader,
                   0,
                   cb.buffer_size,
                   256, /* Be conservative about alignment */
@@ -597,7 +598,7 @@ prepare_ps_constants_userbuf(struct NineDevice9 *device)
         }
     }

-    u_upload_unmap(context->pipe->const_uploader);
+    u_upload_unmap(context->pipe->stream_uploader);
     cb.user_buffer = NULL;

     /* Free previous resource */
axeldavy commented 5 years ago

Just a few information . If shader is using indirect addressing, loops and/or booleans, you can have slightly more than 256 constants. I guess that doesn't happen for this game. . With this git tree, there are patches to compress the constant ranges, thus nine only uploads what is going to be used by the shader, except for the case of constant indirect addressing (can only happen with vs), where we have to upload the 256 float slots. It's likely not all of them end up used, which may be why STREAM would perform better (because the transfert via the pci bus would only be for the used constants). That doesn't explain why the stream_uploader would be faster.

Venemo commented 5 years ago

@axeldavy Is it possible to somehow check if the game uses indirect addressing? Just to confirm or rule out that possibility.

axeldavy commented 5 years ago

The game uses indirect addressing. I don't know the ratio 'constants sent because shader uses indirect addressing' over 'all constants sent', but given usually a shader needs less that 25 float[4] slots, even if a few shaders use indirect addressing, the ratio can be high.

Venemo commented 5 years ago

Can we somehow eliminate the indirect addressing?

marekolsak commented 5 years ago

Thanks. I think the problem with DEFAULT is that more buffers are competing for the visible VRAM window, which makes the memory manager move buffers more often.

If you use STREAM too much, you might decrease GPU performance, because PCIe is slower.

axeldavy commented 5 years ago

The same game has good performance on pcie x16 but not on thunderbolt. That seems surpring to me that the culprit for the performance manager would move the buffers in VRAM, as else pcie x16 would be affected as well. Except maybe if the buffers get moved to GTT ? That would explain a lot.

marekolsak commented 5 years ago

Can you test this branch? https://cgit.freedesktop.org/~mareko/mesa/log/?h=const-uploads-sdma

Venemo commented 5 years ago

@marekolsak Wow, thank you. I merged your branch with the ixit master, compiled a release build and ran the game with SDMA=true. Now in the same scene I'm getting ~85-90 fps. GPU usage is 98% according to GALLIUM_HUD and 95% according to radeontop.

perf top result:

   9,56%  [amdgpu]                  [k] amdgpu_mm_rreg
   1,12%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,99%  d3dadapter9.so.1.0.0      [.] nine_update_state
   0,86%  d3dadapter9.so.1.0.0      [.] amdgpu_lookup_or_add_slab_buffer
   0,84%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   0,68%  [kernel]                  [k] update_blocked_averages
   0,68%  d3dadapter9.so.1.0.0      [.] amdgpu_add_fence_dependencies_bo_list
   0,61%  libc-2.28.so              [.] __memcmp_avx2_movbe
   0,45%  d3dadapter9.so.1.0.0      [.] nine_context_set_texture
   0,40%  [kernel]                  [k] psi_task_change
   0,37%  d3dadapter9.so.1.0.0      [.] si_emit_spi_map
   0,36%  d3dadapter9.so.1.0.0      [.] NineDevice9_SetTexture

Some more experiments that I tried:

marekolsak commented 5 years ago

Thanks that you added SDMA=true. I forgot to tell you about it.

I've updated the branch with a performance fix. It should now reach 100% GPU load: https://cgit.freedesktop.org/~mareko/mesa/log/?h=const-uploads-sdma

iiv3 commented 5 years ago

@marekolsak, would it be possible to implement the same changes for the r600 driver?

Venemo commented 5 years ago

@marekolsak I've ran another series of tests for you.

Branch FPS GTT VRAM GPU use hotspot
ixit/master 55 60 MB 1.7 GB 45-50% 25% nine_update_state
20% amdgpu_mm_reg
mareko/
const-uploads-sdma
105 70 MB 1.7 GB 100% 10% amdgpu_mm_reg
merged 105-110 70 MB 1.7 GB 100% 10% amdgpu_mm_reg
merged
+ this patch
110 45 MB 1.7 GB 100% 10% amdgpu_mm_reg

Conclusions:

marekolsak commented 5 years ago

@marekolsak, would it be possible to implement the same changes for the r600 driver?

It would be possible, but the performance improvement wouldn't be as high, because the DMA engine is not asynchronous on radeon. For r600, I recommend changing const_uploader from DEFAULT to STREAM.

@Venemo My series decreases performance for trivial benchmarks by 20%, so I need some amdgpu kernel changes to get the performance back.

axeldavy commented 5 years ago

Surprising. Any idea what bad kernel behaviour is causing the performance drop ?

marekolsak commented 5 years ago

Surprising. Any idea what bad kernel behaviour is causing the performance drop ?

We submit 2 CS ioctls. We need to merge them into 1.

Venemo commented 5 years ago

@marekolsak Can you shed some light on what amdgpu_mm_reg is any why is it the hot spot? As far as I understand it has to do with memory-mapped registers but couldn't quite figure out why it is eating 10% CPU time in the game.

marekolsak commented 5 years ago

amdgpu_mm_reg is used for getting the GPU load.