solus-project / linux-steam-integration

Helper for enabling better Steam integration on Linux
GNU Lesser General Public License v2.1
432 stars 19 forks source link

LD_AUDIT sacrifices some performance #15

Closed amonakov closed 7 years ago

amonakov commented 7 years ago

There's an issue in Glibc that, when LD_AUDIT is non-empty, causes all calls via PLT (i.e. normally all calls to functions implemented in shared libraries) to go via a hook that saves some registers (including vector registers that may hold passed arguments) on stack, calls into dynamic linker to invoke la_pltenter hooks (even if none registered), restores registers, invokes the original destination function, invokes la_pltexit hooks, and finally returns to caller. Obviously this is slow and unnecessary if all the audit module wants is to redirect some libraries. The Glibc bugreport is here: https://sourceware.org/bugzilla/show_bug.cgi?id=15533 (I've hit this issue back then when playing with an idea similar to yours).

It appears you're setting LD_AUDIT for all child processes including games, so that slows games to some degree. If not, can you add a clarifying comment somewhere?

(edited: grammar and clarity)

ikeydoherty commented 7 years ago

That's a fairly old bug report @amonakov but thank you for bringing it up. I'm happy to do some benchmarking if you have a copy of the original test program, so we can test it against liblsi-intercept to see if the fact still stands. (I've not noticed any noticeable slowdowns here tbf.)

If there is still a slowdown, then we'll just patch glibc and document that, as the module only implements la_objsearch - any slow-down would be very much a bug in the libc implementation and not LSI.

amonakov commented 7 years ago

The original test program is attached to the aforementioned bug report; here's a direct link to the attachment: https://sourceware.org/bugzilla/attachment.cgi?id=7044

Note that if your toolchain enables hardening by default (-z relro -z now) you won't see the slowdown because the test program won't use PLT (but games aren't usually compiled like that).

The reason I've brought it up is exactly because this Glibc bug remains unfixed.

If you prefer to patch Glibc on your end, what would be your recommendation to people packaging this on other distros?

ikeydoherty commented 7 years ago

I'm definitely seeing a minor regression with your test case here:

0.14user 0.00system 0:00.14elapsed 100%CPU (0avgtext+0avgdata 5904maxresident)k
0inputs+0outputs (0major+69minor)pagefaults 0swaps
time env LD_AUDIT=./libaudit.so ./main
1.15user 0.00system 0:01.16elapsed 99%CPU (0avgtext+0avgdata 9104maxresident)k
0inputs+0outputs (0major+158minor)pagefaults 0swaps

However, when I build your libaudit with the distro CFLAGS:

cc    -c -o main.o main.c
cc  -o main main.o -lm
cc    -c -o libaudit.o libaudit.c
cc "-g2 -O3 -pipe -fPIC -Wformat -Wformat-security -fno-omit-frame-pointer -fexceptions -D_FORTIFY_SOURCE=2 -fstack-protector --param ssp-buffer-size=32 -fasynchronous-unwind-tables -ftree-vectorize -feliminate-unused-debug-types -Wall -Wno-error -Wp,-D_REENTRANT" -shared -o libaudit.so libaudit.o
time ./main
0.03user 0.00system 0:00.03elapsed 100%CPU (0avgtext+0avgdata 5728maxresident)k
0inputs+0outputs (0major+67minor)pagefaults 0swaps
time env LD_AUDIT=./libaudit.so ./main
0.03user 0.00system 0:00.04elapsed 95%CPU (0avgtext+0avgdata 9296maxresident)k
0inputs+0outputs (0major+159minor)pagefaults 0swaps

Note that libaudit is being built with the CFLAGS, not the binary (representing a proprietary game). Also note changing -O3 to the normalised package -O2 has zero difference.

If I reintroduce your -fno-builtin-sqrt call, then the regression is back:

CFLAGS='-fno-builtin-sqrt' make
cc -fno-builtin-sqrt   -c -o main.o main.c
cc -fno-builtin-sqrt -o main main.o -lm
cc -fno-builtin-sqrt   -c -o libaudit.o libaudit.c
cc "-g2 -O2 -pipe -fPIC -Wformat -Wformat-security -fno-omit-frame-pointer -fexceptions -D_FORTIFY_SOURCE=2 -fstack-protector --param ssp-buffer-size=32 -fasynchronous-unwind-tables -ftree-vectorize -feliminate-unused-debug-types -Wall -Wno-error -Wp,-D_REENTRANT" -shared -o libaudit.so libaudit.o
time ./main
0.12user 0.00system 0:00.12elapsed 100%CPU (0avgtext+0avgdata 5904maxresident)k
0inputs+0outputs (0major+69minor)pagefaults 0swaps
time env LD_AUDIT=./libaudit.so ./main
1.09user 0.00system 0:01.09elapsed 99%CPU (0avgtext+0avgdata 8736maxresident)k
0inputs+0outputs (0major+154minor)pagefaults 0swaps

Thus I assume this is more about symbol resolution time, thus, I hacked the demo to call some gtk_ calls:

+ ./main

real    0m0.128s
user    0m0.113s
sys 0m0.010s
+ env LD_AUDIT=./libaudit.so ./main

real    0m0.150s
user    0m0.135s
sys 0m0.011s

Even building everything with hardening didn't make a significant difference after.

Finally, after installing your patch, even with a hardened toolchain (which Solus uses by default), and having done tests with full relro on the main binary and audit lib, and finally replacing it with the LSI lib:

+ ./main

real    0m0.127s
user    0m0.112s
sys 0m0.011s
+ env LD_AUDIT=/usr/lib64/liblsi-intercept.so ./main

real    0m0.128s
user    0m0.113s
sys 0m0.010s

Basically, we need the rtld-audit interface, and we also need your patch. Given that LSI is aimed at distribution integrators, my hope is that they also integrate your patch (we can add this to Solus without issue). It seems your original patch thread died out, perhaps now is the time to upstream it so that all the distributions benefit from it?

Distributions like Ubuntu are more willing to import an out of series patch to fix a bug when it has already landed in the VCS of the upstream project. :)

ikeydoherty commented 7 years ago

Oh, and as a final metric, using your installed patches and your original test:

0.13user 0.00system 0:00.13elapsed 99%CPU (0avgtext+0avgdata 5936maxresident)k
0inputs+0outputs (0major+70minor)pagefaults 0swaps
time env LD_AUDIT=./libaudit.so ./main
0.12user 0.00system 0:00.12elapsed 99%CPU (0avgtext+0avgdata 9472maxresident)k
0inputs+0outputs (0major+163minor)pagefaults 0swaps
ikeydoherty commented 7 years ago

glibc patch import into Solus: https://dev.solus-project.com/R927:afa5b639e8a9b62618457a304d1e6fb42a9f2066

ikeydoherty commented 7 years ago

Thinking further on this, and correct me if I'm wrong, but the performance regression should only come from initial symbol resolution, thus affecting startup time and module load time, right? During the initial mapping.

Anyway, this further illustrates the need for a self contained LSI bundle that is free from distro issues..

amonakov commented 7 years ago

Thinking further on this, and correct me if I'm wrong, but the performance regression should only come from initial symbol resolution, thus affecting startup time and module load time, right? During the initial mapping.

No, of course not, please read main.c in the testcase (note that it deliberately calls the same function in a loop many times to highlight the issue) and the initial report. The issue is that every runtime call that goes via PLT gets slower, not just initial calls!

If only initial calls get slower, that's not a major issue for games in the first place.

ikeydoherty commented 7 years ago

Ah well that's not good at all. Just read properly through _dl_relocate_object, apologies, not awake that long. :)

OK so I'm going to document this issue within the README, just so integrators know the story. Obviously it would be fantastic if upstream accepts your patch (thank you for that!). FWIW LSI does allow you to turn off the intercept module, which may actually come in useful for those wanting to do benchmarks with and without the patch inside the games themselves.

FWIW I'm aware of the pressure on distributions when faced with integrating Steam, and it is becoming a heavy burden for them. This is why I'm looking to third party application systems with the view of building a specialised (ABI compatible) runtime containing a strict-mode LSI (and your glibc patch ofc!) that would effectively be a Solus-based runtime to provide the same Steam experience everywhere, even on distributions not supporting multilib.

In these third party systems we can ensure only our own libraries are used, and there is no more cross contamination, and distributions wouldn't have to worry about these issues anymore. :)

ikeydoherty commented 7 years ago

^ I've documented this in the README - if you feel it needs more clarification or details, please let me know :)

amonakov commented 7 years ago

Users with older AVX-capable CPUs, especially the famous SandyBridge generation (i5-2500 and such) should especially beware, since there the penalty due to this issue is the highest. My test indicates roughly extra 420 cycles per call (this very high!), of those 140 I believe are twice 70 cycles avx transition penalty; didn't try to accurately analyze the rest.

ikeydoherty commented 7 years ago

Damn - very common CPU too.

ikeydoherty commented 7 years ago

Gonna close this now as the issue is documented, Solus is patched, and we're gonna provide a Snap with a patched glibc.