x86-64 host optimizations

tkchia commented 1 year ago

On my local PC, this patch gives a small (1.4%—1.8%) but consistent improvement in the running time of the third_party/cosmo/2/test_suite_mpi.com test program. E.g.

PASSED (323 / 323 tests (1 skipped))
RL: took 20,104,433µs wall time
RL: ballooned to 5,740kb in size
RL: needed 20,096,321µs cpu (0% kernel)
RL: caused 887 page faults (99% memcpy)
RL: 885 context switches (0% consensual)
RL: performed 0 reads and 8 write i/o operations

vs. (without the patch)

PASSED (323 / 323 tests (1 skipped))
RL: took 20,407,807µs wall time
RL: ballooned to 5,640kb in size
RL: needed 20,399,361µs cpu (0% kernel)
RL: caused 889 page faults (99% memcpy)
RL: 661 context switches (0% consensual)
RL: performed 0 reads and 8 write i/o operations

tkchia commented 1 year ago

With https://github.com/jart/blink/pull/137/commits/98079d733a0f76d5e9b2fb545df711a8512811f3 , a further whopping 7.4% improvement:

PASSED (323 / 323 tests (1 skipped))
RL: took 18,715,678µs wall time
RL: ballooned to 5,784kb in size
RL: needed 18,708,731µs cpu (0% kernel)
RL: caused 841 page faults (99% memcpy)
RL: 577 context switches (1% consensual)
RL: performed 0 reads and 8 write i/o operations

tkchia commented 1 year ago

Simply disabling the use of Intel control-flow protection instructions (https://github.com/jart/blink/pull/137/commits/309735b8fdb162908efe0a78fc592030a676d788) gives instead an improvement of about 6.9%. E.g.

PASSED (323 / 323 tests (1 skipped))
RL: took 18,778,782µs wall time
RL: ballooned to 5,696kb in size
RL: needed 18,699,182µs cpu (0% kernel)
RL: caused 843 page faults (98% memcpy)
RL: 707 context switches (1% consensual)
RL: performed 1,176 reads and 8 write i/o operations

The Intel CET instructions (e.g. endbr64) promise better protection against code execution cyber-attacks — but the protection only occurs if both the OS and CPU know about such instructions. It seems Linux kernel support for Intel CET is not quite in the main stream yet.

tkchia commented 1 year ago

A further slight improvement of about 0.8% in the running time (https://github.com/jart/blink/commit/fbef1348c2493ca5fd24e4efb7a29187aa9e8e19):

PASSED (323 / 323 tests (1 skipped))
RL: took 18,597,261µs wall time
RL: ballooned to 5,592kb in size
RL: needed 18,589,714µs cpu (0% kernel)
RL: caused 837 page faults (99% memcpy)
RL: 616 context switches (0% consensual)
RL: performed 0 reads and 8 write i/o operations

I also experimented with using lahf instead of pushfq + pop reg64, but this does not appear to help much.

tkchia commented 1 year ago

PASSED (323 / 323 tests (1 skipped))
RL: took 18,403,306µs wall time
RL: ballooned to 5,680kb in size
RL: needed 18,354,083µs cpu (0% kernel)
RL: caused 843 page faults (98% memcpy)
RL: 564 context switches (2% consensual)
RL: performed 1,176 reads and 8 write i/o operations

ghaerr commented 1 year ago

Hello @tkchia,

Nice job recognizing and getting rid of the endbr64 instructions and the resulting speedup. Your ALU_FAST macros show some deep knowledge of gcc's asm directive too - looks like your probably spent a bit of time getting those working :)

What's the issue with musl cross-make, is epoll-wait2 not implemented on musl?

Thank you!

tkchia commented 1 year ago

Hello @ghaerr,

looks like your probably spent a bit of time getting those working :)

Yeah, I got bitten by a missing & constraint modifier a while back in a different project. :neutral_face:

What's the issue with musl cross-make, is epoll-wait2 not implemented on musl?

Yes, the function is not declared in the pre-built musl cross toolchains.

A more "proper" way to fix this would probably be to do a separate ./configure pass for each of the $(ARCHITECTURES) to cross-compile to. But then again, this seems overkill, since the cross toolchains are mainly used for testing at the moment.

Thank you!

jart / blink

x86-64 host optimizations #137