Closed metalefty closed 6 months ago
I found a benchmark comparing CRC algorighm.
Yup this has been a known issue with RFX for a while. The CRC hashing algorithm is slow. One way to address this has been https://github.com/neutrinolabs/xorgxrdp/pull/167, (using other hashing algorithms) which I incorporated into mainline_merge
but left out because it wasn't fully tested and merged originally.
We need to investigate other hashing codecs that will be better, and I think there is also room to use SIMD optimized hashes for even more performance.
If you'd like I can get the relevant bits from https://github.com/neutrinolabs/xorgxrdp/pull/167 rebased again off of the latest and we can see if it helps?
When I run that page with Xorg glamor, it's using almost no CPU, hehe. I don't really see why GFX should be more than RFX. I'll test that. They should both be using the slow CRC32. We can switch to something else like a hash. Don't worry about the EGL / glamor CRC stuff. That will all get moved to the helper later.
I got working SSE2 accel for librfxcodec progressive. I still have to accel the RGB to YUV in Xorg. I think the CRC change to hash and SSE improvements should really help.
Thank you both. I remembered #167 now.
I thought CRC is required by protocol specification but it sounds no. We can try whatever fast hash if there is no bound to use CRC. I think we should use libraries which has built-in SIMD accelerations. I'll try some hash algorithms including wyhash.
When I run that page with Xorg glamor, it's using almost no CPU, hehe.
I see CRC checking is skipped using glamor. https://github.com/neutrinolabs/xorgxrdp/blob/af43a8b36ddc8a0aa1a7d0c4eaa499948aac0294/module/rdpEgl.c#L608-L621
I have raised #301. Wyhash reduces CPU usage by 1/2 to 1/3.
I don't really see why GFX should be more than RFX. I'll test that.
@jsorg71 It turned out this is not a regression.
This issue is reported by an enterprise user. They reported this as a regression between v0.9 and v0.10 however they're using xorgxrdp with the wyhash patch applied in their internal build. The difference of CPU usage is almost caused by wyhash VS CRC.
The CRC is still done in rdpEGL.c, it just done in the GPU. One reason I didn't like wyhash algro is that it produced a 64 bit hash. The largest you can return with GPU is 32 bit(one pixel). I was looking at alder32 but that was getting too many false hits. Then I kinda ran out of steam on the topic. But it's ok, we can do one thing with CPU and another with gpu. Speed is most important here.
We can also try xxhash. It is also a fast hash algorithm. and there is XXH32 variant that returns 32-bit hash.
Quick test, XXH32 is slower than XXH64 and XXH128. I ran xxhsum
for a 10GB file.
-HHASHTYPE
Hash selection. HASHTYPE means 0=XXH32, 1=XXH64, 2=XXH128,
3=XXH3. Note that -H3 triggers --tag, which can't be skipped
(this is to reduce risks of confusion with -H2 (XXH64)).
Alternatively, HASHTYPE 32=XXH32, 64=XXH64, 128=XXH128. Default
value is 1 (XXH64)
% ls -l CentOS-Stream-ec2-9-20230605.0.x86_64.raw
-rw-r--r-- 1 root wheel 10737418240 6月 6 2023 CentOS-Stream-ec2-9-20230605.0.x86_64.raw
% for i in 0 1 2 3; do time xxhsum -H${i} CentOS-Stream-ec2-9-20230605.0.x86_64.raw; done
325ef4d5 CentOS-Stream-ec2-9-20230605.0.x86_64.raw
xxhsum -H0 CentOS-Stream-ec2-9-20230605.0.x86_64.raw 1.50s user 0.89s system 99% cpu 2.388 total
f559c7d9b8a9d78a CentOS-Stream-ec2-9-20230605.0.x86_64.raw
xxhsum -H1 CentOS-Stream-ec2-9-20230605.0.x86_64.raw 0.78s user 0.82s system 99% cpu 1.604 total
01998f10a7e89bae4bc57d94b32e0867 CentOS-Stream-ec2-9-20230605.0.x86_64.raw
xxhsum -H2 CentOS-Stream-ec2-9-20230605.0.x86_64.raw 0.54s user 0.79s system 99% cpu 1.334 total
XXH3 (CentOS-Stream-ec2-9-20230605.0.x86_64.raw) = 4bc57d94b32e0867
xxhsum -H3 CentOS-Stream-ec2-9-20230605.0.x86_64.raw 0.52s user 0.86s system 99% cpu 1.379 total
My processor is Ryzen 7 5700G.
CPU: AMD Ryzen 7 5700G with Radeon Graphics (3793.00-MHz K8-class CPU)
Origin="AuthenticAMD" Id=0xa50f00 Family=0x19 Model=0x50 Stepping=0
Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
AMD Features2=0x75c237ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX,ADMSKX>
Structured Extended Features=0x219c97a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA>
Structured Extended Features2=0x40069c<UMIP,PKU,OSPKE,VAES,VPCLMULQDQ,RDPID>
Structured Extended Features3=0x10<FSRM>
XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
AMD Extended Feature Extensions ID EBX=0x191ef657<CLZERO,IRPerf,XSaveErPtr,RDPRU,WBNOINVD,IBPB,IBRS,STIBP,STIBP_ALWAYSON,PREFER_IBRS,SSBD>
SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
TSC: P-state invariant, performance statistics
I did a benchmark to find out which hash algorithm is faster. wyhash with lazy color convert is a winner. Howver, wyhash and xxhash64 are very close. The result may be different if I implemented lazy color convert to xxhash64.
The result is:
wyhash w/lazy > wyhash ≈ xxhash64 > xxhash32
Item | Detail |
---|---|
CPU | Ryzen 7 5700G |
OS | FreeBSD 14-STABLE |
xrdp | 0.10.0-beta.2 |
xorgxrdp | 0.10.0 + wyhash/whash+lazy color convert/xxhash32/xxhash64 patch |
Screen geometry | 1920x1200 |
Browser | Iridium |
Desktop Environment | Xfce4 |
Testing #301, #302 and vanilla v0.10.0.
cmatirx
at maximized terminal. See CPU usage with top
.
The screen that I'm testing looks like this:
Hash type | Content | Xorg CPU usage |
---|---|---|
CRC32 | cmatrix | 51-62% |
CRC32 | random rectangles | 63-66% |
CRC32 | random rectangles (white on white) | 62-64% |
wyhash | cmatrix | 27-29% |
wyhash | random rectangles | 34-36% |
wyhash | random rectangles (white on white) | 34-35% |
wyhash w/lazy | cmatrix | 22-24% |
wyhash w/lazy | random rectangles | 27-29% |
wyhash w/lazy | random rectangles (white on white) | 26-27% |
xxhash32 | cmatrix | 30-32% |
xxhash32 | random rectangles | 35-36% |
xxhash32 | random rectangles (white on white) | 34-35% |
xxhash64 | cmatrix | 28-30% |
xxhash64 | random rectangles | 34-35% |
xxhash64 | random rectangles (white on white) | 33-34% |
Here's an additional result xxhash64 w/lazy.
Hash type | Content | Xorg CPU usage |
---|---|---|
xxhash64 w/lazy | cmatrix | 23-25% |
xxhash64 w/lazy | random rectangles | 29-31% |
xxhash64 w/lazy | random rectangles (white on white) | 28-29% |
Wyhash has been merged (#301).
GFX (v0.10) uses CPU significantly higher compared to v0.9.
Sample web page to reproduce:
The page draws random rectangles to a canvas, then does the same after clicking "Press to toggle" but draws white on white so it is invisible screen updates. The hotspot is
crc_process_data
.I'm still investigating why GFX does more CRC calculation and if we can reduce CRC calculation or accelerate using a faster algorithm or library such as crc32c.
@jsorg71 @Nexarian Any thoughts?