Closed postwait closed 6 years ago
Thank you very much for your PR, @postwait :D
I have just tested:
$ gcc -c -fPIC -o acosw.o acosw.S
$ gcc -c -fPIC -o aco.o aco.c
$ gcc -shared -fPIC acosw.o aco.o -o libaco.so
$ gcc -L ${path} -Wall -o bench_so test_aco_benchmark.c -laco
$ ldd bench_so
linux-vdso.so.1 => (0x00007ffc37ffc000)
libaco.so => ${path}/libaco.so (0x00007f4badf41000)
libc.so.6 => /lib64/libc.so.6 (0x00007f4bad959000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4badd26000)
$ gcc -g -O2 acosw.S aco.c -o bench_static test_aco_benchmark.c
$ ldd bench_static
linux-vdso.so.1 => (0x00007ffe1f666000)
libc.so.6 => /lib64/libc.so.6 (0x00007fcf454af000)
/lib64/ld-linux-x86-64.so.2 (0x00007fcf4587c000)
And found that the speed of acosw
in bench_so
drops about 36% when compared with bench_static
.
And the binary of libaco itself is very tiny (15k~). So, could you please explain why you want such support?
Maybe this kind of support is fine, but there has to be a corresponding well-documented notice about such usage.
If you build the acosw.S as part of a shared object (libcall
s cannot be resolved without the PLT/GOT.
It will only be enabled (as required) when -fPIC
is added to the CFLAGS for building shared objects.
Out of curiosity, if you LD_BIND_NOW=1
in the environment of your execution, does it speed back up at all?
I have jus tested the LD_BIND_NOW=1
and the 36% drop is still there.
If you build the acosw.S as part of a shared object (lib
.so) to be linked into another executable, it will not build b/c those call
s cannot be resolved without the PLT/GOT.It will only be enabled (as required) when
-fPIC
is added to the CFLAGS for building shared objects.
Yes. But the binary of libaco itself is so tiny (15k~). So, does there exist some situation which needs such shared object? Why not just use the static linking instead?
Alas, this is the penalty of shared libraries. So you're seeing a jump from 10ns switches to 14ns switches? That isn't too surprising. It's the cost of jumping through the procedure linkage table for aco_funcp_protector
... I'll think some more about potentially speeding this up, but it could be really ugly.
In your tests you're compiling everything with -fPIC? Those changes to acosw.S should never play into the performance as those are just failure calls. I think we're just seeing the expected PLT penalty of having things like aco_yield1
aco_resume
and acosw
being in the PLT. That's expected -- it doesn't hurt anyone when it is compiled statically.
Alas, this is the penalty of shared libraries. So you're seeing a jump from 10ns switches to 14ns switches? That isn't too surprising. It's the cost of jumping through the procedure linkage table for
aco_funcp_protector
...
Yes, our acosw
is so blazing fast - 10ns ;-)
I'll think some more about potentially speeding this up, but it could be really ugly.
Yep... so, do you still want such support?
Personally, I think it is better to just static link libaco to the application instead of dynamic loading by something like libaco.so
.
In your tests you're compiling everything with -fPIC? Those changes to acosw.S should never play into the performance as those are just failure calls.
This is how I built the bench_so
:
$ gcc -c -fPIC -o acosw.o acosw.S
$ gcc -c -fPIC -o aco.o aco.c
$ gcc -shared -fPIC acosw.o aco.o -o libaco.so
$ gcc -L ${path} -Wall -o bench_so test_aco_benchmark.c -laco
$ ldd bench_so
linux-vdso.so.1 => (0x00007ffc37ffc000)
libaco.so => ${path}/libaco.so (0x00007f4badf41000)
libc.so.6 => /lib64/libc.so.6 (0x00007f4bad959000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4badd26000)
Only aco.c and acosw.S are built with the -fPIC
flag.
Ok, I think it is fine to add such support. But still, a well-documented notice about the performance penalty of such usage is neccessary too.
I think the support is required. I don't see another way to support an app (like ours) that dynamically loads libraries that implement features that need to call aco_yield and aco_resume. Do you see a way around this?
Also, I see significantly less performance penalty than you see by adding -O3
along with -fPIC
.
Also, I see significantly less performance penalty than you see by adding
-O3
along with-fPIC
.
I just tried the O3
and the drop turns from 36% to 20%.
I think the support is required. I don't see another way to support an app (like ours) that dynamically loads libraries that implement features that need to call aco_yield and aco_resume. Do you see a way around this?
Agree with you.
This PR would be merged later. And a well-documented notice about the performance penalty of such usage would be added too.
Merged. Thank you very much, @postwait.
Thanks, @postwait. I would like to take some time to check it :D