xorgxrdp uses high CPU in CRC

metalefty commented 7 months ago

GFX (v0.10) uses CPU significantly higher compared to v0.9.

Sample web page to reproduce:

https://people.freebsd.org/~meta/tmp/whitebox.html

The page draws random rectangles to a canvas, then does the same after clicking "Press to toggle" but draws white on white so it is invisible screen updates. The hotspot is crc_process_data.

I'm still investigating why GFX does more CRC calculation and if we can reduce CRC calculation or accelerate using a faster algorithm or library such as crc32c.

@jsorg71 @Nexarian Any thoughts?

metalefty commented 7 months ago

I found a benchmark comparing CRC algorighm.

https://github.com/htot/crc32c?tab=readme-ov-file#benchmarks

Nexarian commented 7 months ago

Yup this has been a known issue with RFX for a while. The CRC hashing algorithm is slow. One way to address this has been https://github.com/neutrinolabs/xorgxrdp/pull/167, (using other hashing algorithms) which I incorporated into mainline_merge but left out because it wasn't fully tested and merged originally.

We need to investigate other hashing codecs that will be better, and I think there is also room to use SIMD optimized hashes for even more performance.

Nexarian commented 7 months ago

If you'd like I can get the relevant bits from https://github.com/neutrinolabs/xorgxrdp/pull/167 rebased again off of the latest and we can see if it helps?

jsorg71 commented 7 months ago

When I run that page with Xorg glamor, it's using almost no CPU, hehe. I don't really see why GFX should be more than RFX. I'll test that. They should both be using the slow CRC32. We can switch to something else like a hash. Don't worry about the EGL / glamor CRC stuff. That will all get moved to the helper later.

jsorg71 commented 7 months ago

I got working SSE2 accel for librfxcodec progressive. I still have to accel the RGB to YUV in Xorg. I think the CRC change to hash and SSE improvements should really help.

metalefty commented 7 months ago

Thank you both. I remembered #167 now.

I thought CRC is required by protocol specification but it sounds no. We can try whatever fast hash if there is no bound to use CRC. I think we should use libraries which has built-in SIMD accelerations. I'll try some hash algorithms including wyhash.

metalefty commented 7 months ago

When I run that page with Xorg glamor, it's using almost no CPU, hehe.

I see CRC checking is skipped using glamor. https://github.com/neutrinolabs/xorgxrdp/blob/af43a8b36ddc8a0aa1a7d0c4eaa499948aac0294/module/rdpEgl.c#L608-L621

I have raised #301. Wyhash reduces CPU usage by 1/2 to 1/3.

metalefty commented 7 months ago

I don't really see why GFX should be more than RFX. I'll test that.

@jsorg71 It turned out this is not a regression.

This issue is reported by an enterprise user. They reported this as a regression between v0.9 and v0.10 however they're using xorgxrdp with the wyhash patch applied in their internal build. The difference of CPU usage is almost caused by wyhash VS CRC.

jsorg71 commented 7 months ago

The CRC is still done in rdpEGL.c, it just done in the GPU. One reason I didn't like wyhash algro is that it produced a 64 bit hash. The largest you can return with GPU is 32 bit(one pixel). I was looking at alder32 but that was getting too many false hits. Then I kinda ran out of steam on the topic. But it's ok, we can do one thing with CPU and another with gpu. Speed is most important here.

metalefty commented 7 months ago

We can also try xxhash. It is also a fast hash algorithm. and there is XXH32 variant that returns 32-bit hash.

metalefty commented 7 months ago

Quick test, XXH32 is slower than XXH64 and XXH128. I ran xxhsum for a 10GB file.

       -HHASHTYPE
              Hash selection. HASHTYPE means 0=XXH32, 1=XXH64, 2=XXH128,
              3=XXH3. Note that -H3 triggers --tag, which can't be skipped
              (this is to reduce risks of confusion with -H2 (XXH64)).
              Alternatively, HASHTYPE 32=XXH32, 64=XXH64, 128=XXH128. Default
              value is 1 (XXH64)

% ls -l CentOS-Stream-ec2-9-20230605.0.x86_64.raw
-rw-r--r--  1 root wheel 10737418240  6月  6  2023 CentOS-Stream-ec2-9-20230605.0.x86_64.raw

% for i in 0 1 2 3; do time xxhsum -H${i} CentOS-Stream-ec2-9-20230605.0.x86_64.raw; done
325ef4d5  CentOS-Stream-ec2-9-20230605.0.x86_64.raw
xxhsum -H0 CentOS-Stream-ec2-9-20230605.0.x86_64.raw  1.50s user 0.89s system 99% cpu 2.388 total
f559c7d9b8a9d78a  CentOS-Stream-ec2-9-20230605.0.x86_64.raw
xxhsum -H1 CentOS-Stream-ec2-9-20230605.0.x86_64.raw  0.78s user 0.82s system 99% cpu 1.604 total
01998f10a7e89bae4bc57d94b32e0867  CentOS-Stream-ec2-9-20230605.0.x86_64.raw
xxhsum -H2 CentOS-Stream-ec2-9-20230605.0.x86_64.raw  0.54s user 0.79s system 99% cpu 1.334 total
XXH3 (CentOS-Stream-ec2-9-20230605.0.x86_64.raw) = 4bc57d94b32e0867
xxhsum -H3 CentOS-Stream-ec2-9-20230605.0.x86_64.raw  0.52s user 0.86s system 99% cpu 1.379 total

My processor is Ryzen 7 5700G.

CPU: AMD Ryzen 7 5700G with Radeon Graphics          (3793.00-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0xa50f00  Family=0x19  Model=0x50  Stepping=0
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x75c237ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX,ADMSKX>
  Structured Extended Features=0x219c97a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA>
  Structured Extended Features2=0x40069c<UMIP,PKU,OSPKE,VAES,VPCLMULQDQ,RDPID>
  Structured Extended Features3=0x10<FSRM>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  AMD Extended Feature Extensions ID EBX=0x191ef657<CLZERO,IRPerf,XSaveErPtr,RDPRU,WBNOINVD,IBPB,IBRS,STIBP,STIBP_ALWAYSON,PREFER_IBRS,SSBD>
  SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
  TSC: P-state invariant, performance statistics

metalefty commented 7 months ago

I did a benchmark to find out which hash algorithm is faster. wyhash with lazy color convert is a winner. Howver, wyhash and xxhash64 are very close. The result may be different if I implemented lazy color convert to xxhash64.

The result is:

wyhash w/lazy > wyhash ≈ xxhash64 > xxhash32

Environment

Item	Detail
CPU	Ryzen 7 5700G
OS	FreeBSD 14-STABLE
xrdp	0.10.0-beta.2
xorgxrdp	0.10.0 + wyhash/whash+lazy color convert/xxhash32/xxhash64 patch
Screen geometry	1920x1200
Browser	Iridium
Desktop Environment	Xfce4

Test method

Testing #301, #302 and vanilla v0.10.0.

Run cmatirx at maximized terminal.
Open the following webpage in browser.
- https://people.freebsd.org/~meta/tmp/whitebox.html

See CPU usage with top.

The screen that I'm testing looks like this:

Result

Hash type	Content	Xorg CPU usage
CRC32	cmatrix	51-62%
CRC32	random rectangles	63-66%
CRC32	random rectangles (white on white)	62-64%
wyhash	cmatrix	27-29%
wyhash	random rectangles	34-36%
wyhash	random rectangles (white on white)	34-35%
wyhash w/lazy	cmatrix	22-24%
wyhash w/lazy	random rectangles	27-29%
wyhash w/lazy	random rectangles (white on white)	26-27%
xxhash32	cmatrix	30-32%
xxhash32	random rectangles	35-36%
xxhash32	random rectangles (white on white)	34-35%
xxhash64	cmatrix	28-30%
xxhash64	random rectangles	34-35%
xxhash64	random rectangles (white on white)	33-34%

metalefty commented 6 months ago

Here's an additional result xxhash64 w/lazy.

Hash type	Content	Xorg CPU usage
xxhash64 w/lazy	cmatrix	23-25%
xxhash64 w/lazy	random rectangles	29-31%
xxhash64 w/lazy	random rectangles (white on white)	28-29%

metalefty commented 6 months ago

wyhash

pros

a little bit faster than xxhash64 in our use case
had been tested for years by an enterprise user
public domain

cons

the original author retired from development

xxhash64

pros

integrated CPU acceleration
might be faster than wyhash on some arch?
development is still active

cons

requires external dependency

metalefty commented 6 months ago

Wyhash has been merged (#301).

neutrinolabs / xorgxrdp