Support @PLT for PIC complication

hnes commented 6 years ago

Thanks, @postwait. I would like to take some time to check it :D

hnes commented 6 years ago

Thank you very much for your PR, @postwait :D

I have just tested:

$ gcc -c -fPIC -o acosw.o acosw.S
$ gcc -c -fPIC -o aco.o aco.c
$ gcc -shared -fPIC acosw.o aco.o -o libaco.so
$ gcc -L ${path} -Wall -o bench_so test_aco_benchmark.c  -laco
$ ldd bench_so
        linux-vdso.so.1 =>  (0x00007ffc37ffc000)
        libaco.so => ${path}/libaco.so (0x00007f4badf41000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f4bad959000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f4badd26000)
$ gcc -g -O2 acosw.S aco.c -o bench_static test_aco_benchmark.c
$ ldd bench_static
        linux-vdso.so.1 =>  (0x00007ffe1f666000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fcf454af000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fcf4587c000)

And found that the speed of acosw in bench_so drops about 36% when compared with bench_static.

And the binary of libaco itself is very tiny (15k~). So, could you please explain why you want such support?

hnes commented 6 years ago

Maybe this kind of support is fine, but there has to be a corresponding well-documented notice about such usage.

postwait commented 6 years ago

If you build the acosw.S as part of a shared object (lib.so) to be linked into another executable, it will not build b/c those calls cannot be resolved without the PLT/GOT.

It will only be enabled (as required) when -fPIC is added to the CFLAGS for building shared objects.

Out of curiosity, if you LD_BIND_NOW=1 in the environment of your execution, does it speed back up at all?

hnes commented 6 years ago

I have jus tested the LD_BIND_NOW=1 and the 36% drop is still there.

hnes commented 6 years ago

If you build the acosw.S as part of a shared object (lib.so) to be linked into another executable, it will not build b/c those calls cannot be resolved without the PLT/GOT.

It will only be enabled (as required) when -fPIC is added to the CFLAGS for building shared objects.

Yes. But the binary of libaco itself is so tiny (15k~). So, does there exist some situation which needs such shared object? Why not just use the static linking instead?

postwait commented 6 years ago

Alas, this is the penalty of shared libraries. So you're seeing a jump from 10ns switches to 14ns switches? That isn't too surprising. It's the cost of jumping through the procedure linkage table for aco_funcp_protector... I'll think some more about potentially speeding this up, but it could be really ugly.

postwait commented 6 years ago

In your tests you're compiling everything with -fPIC? Those changes to acosw.S should never play into the performance as those are just failure calls. I think we're just seeing the expected PLT penalty of having things like aco_yield1 aco_resume and acosw being in the PLT. That's expected -- it doesn't hurt anyone when it is compiled statically.

hnes commented 6 years ago

Alas, this is the penalty of shared libraries. So you're seeing a jump from 10ns switches to 14ns switches? That isn't too surprising. It's the cost of jumping through the procedure linkage table for aco_funcp_protector...

Yes, our acosw is so blazing fast - 10ns ;-)

I'll think some more about potentially speeding this up, but it could be really ugly.

Yep... so, do you still want such support?

Personally, I think it is better to just static link libaco to the application instead of dynamic loading by something like libaco.so.

hnes commented 6 years ago

In your tests you're compiling everything with -fPIC? Those changes to acosw.S should never play into the performance as those are just failure calls.

This is how I built the bench_so:

$ gcc -c -fPIC -o acosw.o acosw.S
$ gcc -c -fPIC -o aco.o aco.c
$ gcc -shared -fPIC acosw.o aco.o -o libaco.so
$ gcc -L ${path} -Wall -o bench_so test_aco_benchmark.c  -laco
$ ldd bench_so
        linux-vdso.so.1 =>  (0x00007ffc37ffc000)
        libaco.so => ${path}/libaco.so (0x00007f4badf41000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f4bad959000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f4badd26000)

Only aco.c and acosw.S are built with the -fPIC flag.

hnes commented 6 years ago

Ok, I think it is fine to add such support. But still, a well-documented notice about the performance penalty of such usage is neccessary too.

postwait commented 6 years ago

I think the support is required. I don't see another way to support an app (like ours) that dynamically loads libraries that implement features that need to call aco_yield and aco_resume. Do you see a way around this?

Also, I see significantly less performance penalty than you see by adding -O3 along with -fPIC.

hnes commented 6 years ago

Also, I see significantly less performance penalty than you see by adding -O3 along with -fPIC.

I just tried the O3 and the drop turns from 36% to 20%.

I think the support is required. I don't see another way to support an app (like ours) that dynamically loads libraries that implement features that need to call aco_yield and aco_resume. Do you see a way around this?

Agree with you.

This PR would be merged later. And a well-documented notice about the performance penalty of such usage would be added too.

hnes commented 6 years ago

Merged. Thank you very much, @postwait.

hnes / libaco

Support @PLT for PIC complication #10