SMP support - Githubissues

Rover-Yu commented 7 years ago

I have been wrote a prototype of SMP support of LKL with POSIX host backend, I will open it later after clean up these ugly parts.

Below are some my experiences:

Tree RCU is not a major problem for us, although to make it workable spent me days of time. The key points are that we have to make sure RCU core can identify out these idled processors and give them opportunities to run RCU bookkeeping works at time, otherwise, GP may take long time to complete even hang up whole LKL application.
The new_host_task() API may create a kernel thread that is not on current processor, this break preconditions of switch_to_host_task(). this problem is still open in the prototype, however, I think that it is a minor problem.
I think that LKL interruption is not an ideal solution for high performance or low latency use cases. The timer is too, each timer interruption create a thread. I guess that we may need some hard works here.
I use a variable in thread local storage area to save current processor id of current LKL-task/thread. and a LKL-task may change its running processor since tasks migration or setup scheduler affinity, so we have to change it in context switching time.
the IPI and per cpu local timer support is necessary too, in my words.

With SMP support, I encountered some other interesting bugs too ...

Lastly, thanks for your great LKL works :)

Rover-Yu commented 7 years ago

And, these per-cpu variables is a trouble.

It needs a runtime page aligned section in loaded program image. But under shared library build, the GNU linker will use default linker script of generating shared library, which ignore such requirement. so, LKL will trigger a SIGSEGV. The current solution is ugly, I manually combine kernel linker script and above default linker script to generate the shared library.

With above hack, although shared library can work well, but kallsyms subsystem is broken since kallsyms use compile time address offsets (that is lkl.o) to generate its internal lookup table. but final shared library build change them.

The broken kallsyms means dump_stack() , panic , oops information become unreadable.

BTW: without SMP support, it seem that we still need a minor hack to make kallsyms work.

Rover-Yu commented 7 years ago

I just uploaded SMP prototype here:

https://github.com/Rover-Yu/lkl-linux

And wrote some documents about it:

https://github.com/Rover-Yu/lkl-linux/wiki

Thanks

tavip commented 7 years ago

Hi Rover,

Thank you for you work, it sounds exciting ! I am currently travelling so I did not get a chance to look at it, but I will do so during the weekend.

Thanks, Tavi

thehajime commented 7 years ago

I gave a quick look and feel so nice !

a couple of questions for now

did you see any improvements by disabling fs and NLS_* comonents ?
even if you said the similar performance with UP, I'm curious how the iperf3 result looks like with the current early stage

For the aarch64 support, it is not upstreamed yet though, there are two arm related PRs which may be helpful (cc: @mxi1). It would be nice if you could tell us which toolchain you used for your test.

https://github.com/lkl/linux/issues/59 https://github.com/lkl/linux/issues/348

Having new ops entry tx_end would be a great idea. I like it.

Thanks for the great patchset and really looking forward to be completed.

mxi1 commented 7 years ago

@thehajime Sorry for my delay. If you need it so much, I can organize the descriptions including toolchain version, Makefile options and how to customized the binutils. They are actually already in the related issues, but I will organize the instructions, which should have only few steps, so you can easily merge into your branch if possible.

thehajime commented 7 years ago

@thehajime Sorry for my delay. If you need it so much, I can organize the descriptions including toolchain version, Makefile options and how to customized the binutils. They are actually already in the related issues, but I will organize the instructions, which should have only few steps, so you can easily merge into your branch if possible.

I was asking toolchain to @Rover-Yu: I just wanted to let you (@mxi1) aware this thread.

(off topic) We are also almost fine with the android support: we tested with mptcp (https://twitter.com/thehajime/status/900596946120736770). With more clean up our code (and your patches), we can make that upstreamed.

Rover-Yu commented 7 years ago

@thehajime

This is the information about my gcc:

EulerOS:~ # rpm -qi gcc
Name        : gcc
Version     : 4.9.3
Release     : 154843.1
Architecture: aarch64
Install Date: Tue Jul 18 22:34:22 2017
Group       : Development/Languages
Size        : 25992206
License     : GPLv3+ and GPLv3+ with exceptions and GPLv2+ with exceptions and LGPLv2+ and BSD
Signature   : RSA/SHA1, Wed May 31 23:30:38 2017, Key ID 600317bc381d7ac3
Source RPM  : gcc-4.9.3-154843.1.src.rpm
Build Date  : Wed May 31 23:27:04 2017
Build Host  : euler-armworker2
Relocations : (not relocatable)
Packager    : http://bugs.euleros.org
Vendor      : huawei
Summary     : Various compilers (C, C++, Objective-C, Java, ...)
Description :
This is compiler for arm64.
EulerOS:~ # gcc --verbose
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/aarch64-linux-gnu/4.9.3/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release -with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,lto --enable-plugin --enable-initfini-array --disable-libgcj --without-isl --without-cloog --enable-gnu-indirect-function --build=aarch64-linux-gnu --disable-multilib
Thread model: posix
gcc version 4.9.3 20160525 (prerelease) (GCC)

It seem that it is a special build by Huawei for their ARM64 servers. In my words, there are so many kinds of configurations of ARM toolchain, it is hard to list all possible items in a static list. The better solution may be to use something like regular expressions here ? Anyway, I am not an expert of ARM systems ...

For performance, I am sorry for I didn't test LKL with enabled file systems and NLS_* ago, I guess that there should not make big networking performance changes. The reasons of I disable them just are to reduce complexity of adding SMP support, and get shorter building time ;)

The LKL performance in my testbed, the iperf3 shows about 1.2 Gpbs bandwidth, not good. The testing steps is as the wiki (https://github.com/Rover-Yu/lkl-linux/wiki), My hardware environment is below:

$ sudo lshw -short
H/W path        Device     Class          Description
=====================================================
                           system         Standard PC (i440FX + PIIX, 1996)
/0                         bus            Motherboard
/0/0                       memory         96KiB BIOS
/0/400                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/401                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/402                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/403                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/404                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/405                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/406                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/407                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/408                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/409                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/40a                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/40b                     processor      Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
/0/1000                    memory         10006MiB System Memory
/0/1000/0                  memory         10006MiB DIMM RAM
...

The tc/veth/netns/packet-sockets are not the performance bottleneck here, I did some micro-benchmarks for them, all of them can reach higher performance number. I don't have suitable hardware to run LKL on DPDK.

The bad message is adding batch operations didn't help us get good performance much too. It seem that it is another trouble.

That is all, thanks!

thehajime commented 7 years ago

@Rover-Yu thanks, the toolchains seems to include a fix which @mxi1 discussed (https://github.com/libos-nuse/lkl-linux/commit/5c5bd5cfb7c78cc2a33ccef631a726875c811fbb#commitcomment-22901895).

For performance, I am sorry for I didn't test LKL with enabled file systems and NLS_* ago, I guess that there should not make big networking performance changes. The reasons of I disable them just are to reduce complexity of adding SMP support, and get shorter building time ;)

thanks. I see your point.

The LKL performance in my testbed, the iperf3 shows about 1.2 Gpbs bandwidth, not good. The testing steps is as the wiki (https://github.com/Rover-Yu/lkl-linux/wiki), My hardware environment is below:

Thanks for sharing the information.

The tc/veth/netns/packet-sockets are not the performance bottleneck here, I did some micro-benchmarks for them, all of them can reach higher performance number. I don't have suitable hardware to run LKL on DPDK.

I don't think current DPDK support doesn't help much: packet sockets are enough for your test.

The bad message is adding batch operations didn't help us get good performance much too. It seem that it is another trouble.

I haven't tried any of them and am not sure 100%, but xmit_more flag in skb might help for this ? (we may need to tweak net/ subsystem in order to benefit on usual application (e.g., iperf) by extending sendmmsg() etc).

Rover-Yu commented 7 years ago

For ARM64 porting, I think your patches are more complete than that my simple hack :)

For batch operation improvement, I ever printf()ed actual batch counts in new added batch interfaces, there are only 1 in most time. so I think your suggestion make sense very much !

After I switching another hardware environment, it seem that LKL/SMP crash easier ... it is a good message :) But, the highest performance is better at this machine, 1.5Gbps now.

I will focus on making SMP support more stable first, then next step is better performance. And, if we used packet socket backend, LKL syscall programming model requires two times of memcpy, first time happens at LKL syscall interface, second time is packet sockets host interfaces.

thehajime commented 7 years ago

After I switching another hardware environment, it seem that LKL/SMP crash easier ... it is a good message :) But, the highest performance is better at this machine, 1.5Gbps now.

it's at least increasing :)

And, if we used packet socket backend, LKL syscall programming model requires two times of memcpy, first time happens at LKL syscall interface, second time is packet sockets host interfaces.

we may also consider to extend with packet_mmap to reduce the number of copies.

Rover-Yu commented 7 years ago

The bug is fixed, lkl_start_kernel() assumed that the init process always run at CPU0, this is not always true now :)

Rover-Yu commented 7 years ago

SMP support on ARM64 are added too.

Rover-Yu commented 7 years ago

Hi, would you have some suggestions or concerns about this SMP prototype ? ;)

It seem that it can pass the basic tests now (started iperf client about 80K times without any error, both on x86_64 and ARM64). I also tried to enable file systems and NLS* support with it, both can compile without problems. but the 'make -C tools/lkl tests' still is failed, I think it is since the linker script or build system is not ready now.

tavip commented 7 years ago

Hi @Rover-Yu , I did take a quick look and my main concern is that the SMP implementation is duplicating stuff from the arch (x86, arm) layers. Would it be possible to implement the SMP required operations (locks, atomics, etc.) as native ops and rely on gcc atomics stuff? These would make the SMP implementation architecture independent.

Rover-Yu commented 7 years ago

Hi @Rover-Yu , I did take a quick look and my main concern is that the SMP implementation is duplicating stuff from the arch (x86, arm) layers. Would it be possible to implement the SMP required operations (locks, atomics, etc.) as native ops and rely on gcc atomics stuff? These would make the SMP implementation architecture independent.

I guess that we can't implement all these operations by architecture independent stuffs. e.g. the SMP barriers and cmpxchg operations are not supported in POSIX even GNU extended libc. Something like spin locks should can be replaced by some new host operations as you said, however, I suspect that we may return back these native operations once we start performance tuning later, kernel itself implementation is better choice, e.g. queued spin lock in latest kernel releases.

I also saw, this indeed breaks portability, there may have another better solution that I don't know yet :)

tavip commented 7 years ago

I guess that we can't implement all these operations by architecture independent stuffs. e.g. the SMP barriers and cmpxchg operations are not supported in POSIX even GNU extended libc.

I think we can implement almost all operations with gcc atomic built-ins: https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html

Using the kernel implementation can be an option as well, I am not excluding, but I think for most usecases a generic implementation may be good enough.

Rover-Yu commented 7 years ago

It seem that these "_sync*" interfaces are marked by legacy :)

I remembered that the linux kernel community ever discussed whether they should use gcc new built-in C11 atomics or memory barriers. the link is https://lwn.net/Articles/586838/ It seem that kernel atomics and current C11 atomics have some subtle differences.

Anyway, your concern are reasonable, I will take look more details here.

BTW: I just tried to replace current posix host operation timer_*() interfaces by timerfd syscall, but It didn't help performance more. however, It indeed can avoid to create a lot of helper timer threads.

speedingdaemon commented 6 years ago

Hi,

So does latest LKL support SMP? Or we still need to have "One workaround is to shard the application port number space thus allowing multiple LKL instances to run simultaneously" - as summarized by Jerry Chu's paper?

thanks!

dimakuv commented 5 years ago

Ping. Curious whether the latest LKL supports SMP?

laijs commented 5 years ago

when cpu A, B, invoke smp_call_function_single() to each other, it will deadlock.

reason: lkl assumes lkl_cpu_get() to be irq-disabled.

lkl / linux

SMP support #370