Closed tkchia closed 1 year ago
With https://github.com/jart/blink/pull/137/commits/98079d733a0f76d5e9b2fb545df711a8512811f3 , a further whopping 7.4% improvement:
PASSED (323 / 323 tests (1 skipped))
RL: took 18,715,678µs wall time
RL: ballooned to 5,784kb in size
RL: needed 18,708,731µs cpu (0% kernel)
RL: caused 841 page faults (99% memcpy)
RL: 577 context switches (1% consensual)
RL: performed 0 reads and 8 write i/o operations
Simply disabling the use of Intel control-flow protection instructions (https://github.com/jart/blink/pull/137/commits/309735b8fdb162908efe0a78fc592030a676d788) gives instead an improvement of about 6.9%. E.g.
PASSED (323 / 323 tests (1 skipped))
RL: took 18,778,782µs wall time
RL: ballooned to 5,696kb in size
RL: needed 18,699,182µs cpu (0% kernel)
RL: caused 843 page faults (98% memcpy)
RL: 707 context switches (1% consensual)
RL: performed 1,176 reads and 8 write i/o operations
The Intel CET instructions (e.g. endbr64
) promise better protection against code execution cyber-attacks — but the protection only occurs if both the OS and CPU know about such instructions. It seems Linux kernel support for Intel CET is not quite in the main stream yet.
A further slight improvement of about 0.8% in the running time (https://github.com/jart/blink/commit/fbef1348c2493ca5fd24e4efb7a29187aa9e8e19):
PASSED (323 / 323 tests (1 skipped))
RL: took 18,597,261µs wall time
RL: ballooned to 5,592kb in size
RL: needed 18,589,714µs cpu (0% kernel)
RL: caused 837 page faults (99% memcpy)
RL: 616 context switches (0% consensual)
RL: performed 0 reads and 8 write i/o operations
I also experimented with using lahf
instead of pushfq
+ pop
reg64, but this does not appear to help much.
PASSED (323 / 323 tests (1 skipped))
RL: took 18,403,306µs wall time
RL: ballooned to 5,680kb in size
RL: needed 18,354,083µs cpu (0% kernel)
RL: caused 843 page faults (98% memcpy)
RL: 564 context switches (2% consensual)
RL: performed 1,176 reads and 8 write i/o operations
Hello @tkchia,
Nice job recognizing and getting rid of the endbr64
instructions and the resulting speedup. Your ALU_FAST
macros show some deep knowledge of gcc's asm
directive too - looks like your probably spent a bit of time getting those working :)
What's the issue with musl cross-make, is epoll-wait2
not implemented on musl?
Thank you!
Hello @ghaerr,
looks like your probably spent a bit of time getting those working :)
Yeah, I got bitten by a missing &
constraint modifier a while back in a different project. :neutral_face:
What's the issue with musl cross-make, is
epoll-wait2
not implemented on musl?
Yes, the function is not declared in the pre-built musl
cross toolchains.
A more "proper" way to fix this would probably be to do a separate ./configure
pass for each of the $(ARCHITECTURES)
to cross-compile to. But then again, this seems overkill, since the cross toolchains are mainly used for testing at the moment.
Thank you!
On my local PC, this patch gives a small (1.4%—1.8%) but consistent improvement in the running time of the
third_party/cosmo/2/test_suite_mpi.com
test program. E.g.vs. (without the patch)