neutrinolabs / xorgxrdp

Xorg drivers for xrdp
Other
428 stars 108 forks source link

xorgxrdp uses high CPU in CRC #300

Closed metalefty closed 2 months ago

metalefty commented 3 months ago

GFX (v0.10) uses CPU significantly higher compared to v0.9.

Sample web page to reproduce:

The page draws random rectangles to a canvas, then does the same after clicking "Press to toggle" but draws white on white so it is invisible screen updates. The hotspot is crc_process_data.

I'm still investigating why GFX does more CRC calculation and if we can reduce CRC calculation or accelerate using a faster algorithm or library such as crc32c.

@jsorg71 @Nexarian Any thoughts?

metalefty commented 3 months ago

I found a benchmark comparing CRC algorighm.

Nexarian commented 3 months ago

Yup this has been a known issue with RFX for a while. The CRC hashing algorithm is slow. One way to address this has been https://github.com/neutrinolabs/xorgxrdp/pull/167, (using other hashing algorithms) which I incorporated into mainline_merge but left out because it wasn't fully tested and merged originally.

We need to investigate other hashing codecs that will be better, and I think there is also room to use SIMD optimized hashes for even more performance.

Nexarian commented 3 months ago

If you'd like I can get the relevant bits from https://github.com/neutrinolabs/xorgxrdp/pull/167 rebased again off of the latest and we can see if it helps?

jsorg71 commented 3 months ago

When I run that page with Xorg glamor, it's using almost no CPU, hehe. I don't really see why GFX should be more than RFX. I'll test that. They should both be using the slow CRC32. We can switch to something else like a hash. Don't worry about the EGL / glamor CRC stuff. That will all get moved to the helper later.

jsorg71 commented 3 months ago

I got working SSE2 accel for librfxcodec progressive. I still have to accel the RGB to YUV in Xorg. I think the CRC change to hash and SSE improvements should really help.

metalefty commented 3 months ago

Thank you both. I remembered #167 now.

I thought CRC is required by protocol specification but it sounds no. We can try whatever fast hash if there is no bound to use CRC. I think we should use libraries which has built-in SIMD accelerations. I'll try some hash algorithms including wyhash.

metalefty commented 3 months ago

When I run that page with Xorg glamor, it's using almost no CPU, hehe.

I see CRC checking is skipped using glamor. https://github.com/neutrinolabs/xorgxrdp/blob/af43a8b36ddc8a0aa1a7d0c4eaa499948aac0294/module/rdpEgl.c#L608-L621

I have raised #301. Wyhash reduces CPU usage by 1/2 to 1/3.

metalefty commented 3 months ago

I don't really see why GFX should be more than RFX. I'll test that.

@jsorg71 It turned out this is not a regression.

This issue is reported by an enterprise user. They reported this as a regression between v0.9 and v0.10 however they're using xorgxrdp with the wyhash patch applied in their internal build. The difference of CPU usage is almost caused by wyhash VS CRC.

jsorg71 commented 3 months ago

The CRC is still done in rdpEGL.c, it just done in the GPU. One reason I didn't like wyhash algro is that it produced a 64 bit hash. The largest you can return with GPU is 32 bit(one pixel). I was looking at alder32 but that was getting too many false hits. Then I kinda ran out of steam on the topic. But it's ok, we can do one thing with CPU and another with gpu. Speed is most important here.

metalefty commented 3 months ago

We can also try xxhash. It is also a fast hash algorithm. and there is XXH32 variant that returns 32-bit hash.

metalefty commented 3 months ago

Quick test, XXH32 is slower than XXH64 and XXH128. I ran xxhsum for a 10GB file.

       -HHASHTYPE
              Hash selection. HASHTYPE means 0=XXH32, 1=XXH64, 2=XXH128,
              3=XXH3. Note that -H3 triggers --tag, which can't be skipped
              (this is to reduce risks of confusion with -H2 (XXH64)).
              Alternatively, HASHTYPE 32=XXH32, 64=XXH64, 128=XXH128. Default
              value is 1 (XXH64)
% ls -l CentOS-Stream-ec2-9-20230605.0.x86_64.raw
-rw-r--r--  1 root wheel 10737418240  6月  6  2023 CentOS-Stream-ec2-9-20230605.0.x86_64.raw

% for i in 0 1 2 3; do time xxhsum -H${i} CentOS-Stream-ec2-9-20230605.0.x86_64.raw; done
325ef4d5  CentOS-Stream-ec2-9-20230605.0.x86_64.raw
xxhsum -H0 CentOS-Stream-ec2-9-20230605.0.x86_64.raw  1.50s user 0.89s system 99% cpu 2.388 total
f559c7d9b8a9d78a  CentOS-Stream-ec2-9-20230605.0.x86_64.raw
xxhsum -H1 CentOS-Stream-ec2-9-20230605.0.x86_64.raw  0.78s user 0.82s system 99% cpu 1.604 total
01998f10a7e89bae4bc57d94b32e0867  CentOS-Stream-ec2-9-20230605.0.x86_64.raw
xxhsum -H2 CentOS-Stream-ec2-9-20230605.0.x86_64.raw  0.54s user 0.79s system 99% cpu 1.334 total
XXH3 (CentOS-Stream-ec2-9-20230605.0.x86_64.raw) = 4bc57d94b32e0867
xxhsum -H3 CentOS-Stream-ec2-9-20230605.0.x86_64.raw  0.52s user 0.86s system 99% cpu 1.379 total

My processor is Ryzen 7 5700G.

CPU: AMD Ryzen 7 5700G with Radeon Graphics          (3793.00-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0xa50f00  Family=0x19  Model=0x50  Stepping=0
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x75c237ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX,ADMSKX>
  Structured Extended Features=0x219c97a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA>
  Structured Extended Features2=0x40069c<UMIP,PKU,OSPKE,VAES,VPCLMULQDQ,RDPID>
  Structured Extended Features3=0x10<FSRM>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  AMD Extended Feature Extensions ID EBX=0x191ef657<CLZERO,IRPerf,XSaveErPtr,RDPRU,WBNOINVD,IBPB,IBRS,STIBP,STIBP_ALWAYSON,PREFER_IBRS,SSBD>
  SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
  TSC: P-state invariant, performance statistics
metalefty commented 3 months ago

I did a benchmark to find out which hash algorithm is faster. wyhash with lazy color convert is a winner. Howver, wyhash and xxhash64 are very close. The result may be different if I implemented lazy color convert to xxhash64.

The result is:

wyhash w/lazy > wyhash ≈ xxhash64 > xxhash32

Environment

Item Detail
CPU Ryzen 7 5700G
OS FreeBSD 14-STABLE
xrdp 0.10.0-beta.2
xorgxrdp 0.10.0 + wyhash/whash+lazy color convert/xxhash32/xxhash64 patch
Screen geometry 1920x1200
Browser Iridium
Desktop Environment Xfce4

Test method

Testing #301, #302 and vanilla v0.10.0.

  1. Run cmatirx at maximized terminal.
  2. Open the following webpage in browser.

See CPU usage with top.

The screen that I'm testing looks like this: image image

Result

Hash type Content Xorg CPU usage
CRC32 cmatrix 51-62%
CRC32 random rectangles 63-66%
CRC32 random rectangles (white on white) 62-64%
wyhash cmatrix 27-29%
wyhash random rectangles 34-36%
wyhash random rectangles (white on white) 34-35%
wyhash w/lazy cmatrix 22-24%
wyhash w/lazy random rectangles 27-29%
wyhash w/lazy random rectangles (white on white) 26-27%
xxhash32 cmatrix 30-32%
xxhash32 random rectangles 35-36%
xxhash32 random rectangles (white on white) 34-35%
xxhash64 cmatrix 28-30%
xxhash64 random rectangles 34-35%
xxhash64 random rectangles (white on white) 33-34%
metalefty commented 2 months ago

Here's an additional result xxhash64 w/lazy.

Hash type Content Xorg CPU usage
xxhash64 w/lazy cmatrix 23-25%
xxhash64 w/lazy random rectangles 29-31%
xxhash64 w/lazy random rectangles (white on white) 28-29%
metalefty commented 2 months ago

wyhash

pros

cons

xxhash64

pros

cons

metalefty commented 2 months ago

Wyhash has been merged (#301).