This document aims at giving an overview on several performance tests done on the machine ISPW60.
Note that due to the blocking-log issue the RTS was compiled with the -DSMX_LOG_UNSAFE flag.
Hardware
The command sudo lshw -short produces the following output:
H/W path Device Class Description
===================================================
system HP Z240 Tower Workstation (L8T12AV)
/0 bus 802F
/0/0 memory 128KiB L1 cache
/0/1 memory 128KiB L1 cache
/0/2 memory 1MiB L2 cache
/0/3 memory 8MiB L3 cache
/0/4 processor Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
/0/5 memory 32GiB System Memory
/0/5/0 memory 8GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
/0/5/1 memory 8GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
/0/5/2 memory 8GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
/0/5/3 memory 8GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
/0/b memory 64KiB BIOS
/0/100 bridge Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers
/0/100/1 bridge Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16)
/0/100/1/0 display GM107GL [Quadro K620]
/0/100/1/0.1 multimedia NVIDIA Corporation
/0/100/14 bus 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller
/0/100/14/0 usb1 bus xHCI Host Controller
/0/100/14/1 usb2 bus xHCI Host Controller
/0/100/14.2 generic 100 Series/C230 Series Chipset Family Thermal Subsystem
/0/100/16 communication 100 Series/C230 Series Chipset Family MEI Controller humdek-unibe-ch/smx-core-rts#1
/0/100/16.3 communication 100 Series/C230 Series Chipset Family KT Redirection
/0/100/17 storage SATA Controller [RAID mode]
/0/100/1f bridge C236 Chipset LPC/eSPI Controller
/0/100/1f.2 memory Memory controller
/0/100/1f.3 multimedia 100 Series/C230 Series Chipset Family HD Audio Controller
/0/100/1f.4 bus 100 Series/C230 Series Chipset Family SMBus
/0/100/1f.6 eno1 network Ethernet Connection (2) I219-LM
/0/6 scsi0 storage
/0/6/0.0.0 /dev/sda disk 1024GB MTFDDAK1T0TBN-1A
/0/6/0.0.0/1 /dev/sda1 volume 511MiB Windows FAT volume
/0/6/0.0.0/2 /dev/sda2 volume 953GiB EXT4 volume
/1 power High Efficiency
The command lscpu produces the following output:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Stepping: 3
CPU MHz: 800.176
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
BogoMIPS: 6816.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
Simple Test Runs
To start things off a simple test run with the following Streamix network was performed:
Src = extern box smx_src_rand( out data )
Snk = box smx_snk_null( in data )
A = Src.Snk
connect A
This program creates two threads, where the first creates random messages and the second destroys them.
The program was run three times where in each run each thread was executed 10'000'000 times.
[2019-09-13 08:49:25.245740] [NOTICE] [net_Src_0] terminate net (loop count: 10000000, loop rate: 296161, wall time: 33.765305)
[2019-09-13 08:49:25.245765] [NOTICE] [net_Snk_1] terminate net (loop count: 10000001, loop rate: 296162, wall time: 33.765297)
[2019-09-13 08:49:25.245788] [NOTICE] [main] end main thread (wall time: 33.765709)
[2019-09-13 09:00:45.814958] [NOTICE] [net_Src_0] terminate net (loop count: 10000000, loop rate: 288089, wall time: 34.711493)
[2019-09-13 09:00:45.814981] [NOTICE] [net_Snk_1] terminate net (loop count: 10000001, loop rate: 288088, wall time: 34.711521)
[2019-09-13 09:00:45.815004] [NOTICE] [main] end main thread (wall time: 34.711777)
[2019-09-13 09:01:36.458231] [NOTICE] [net_Src_0] terminate net (loop count: 10000000, loop rate: 281976, wall time: 35.463926)
[2019-09-13 09:01:36.458255] [NOTICE] [net_Snk_1] terminate net (loop count: 10000001, loop rate: 281976, wall time: 35.463948)
[2019-09-13 09:01:36.458296] [NOTICE] [main] end main thread (wall time: 35.464232)
Next, the number of threads was scaled to half the number of cores (4) with the following program:
Src = extern box smx_src_rand( out data )
Snk = box smx_snk_null( in data )
A = Src.Snk
B = A|A
connect B
This program creates two pairs of two threads, where the first thread of a pair creates random messages and the second destroys them.
The program was run three times where in each run each thread was executed 10'000'000 times.
Next, the number of threads was scaled to the number of cores (8) with the following program:
Src = extern box smx_src_rand( out data )
Snk = box smx_snk_null( in data )
A = Src.Snk
B = A|A|A|A
connect B
This program creates four pairs of two threads, where the first thread of a pair creates random messages and the second destroys them.
The program was run three times where in each run each thread was executed 10'000'000 times.
Next, the number of threads was scaled to the number of supported hw-based threads (16) with the following program:
Src = extern box smx_src_rand( out data )
Snk = box smx_snk_null( in data )
A = Src.Snk
B = A|A|A|A
connect B|B
This program creates eight pairs of two threads, where the first thread of a pair creates random messages and the second destroys them.
The program was run three times where in each run each thread was executed 10'000'000 times.
Finally, the number of threads was scaled to the double of the number of supported hw-based threads (32) with the following program:
Src = extern box smx_src_rand( out data )
Snk = box smx_snk_null( in data )
A = Src.Snk
B = A|A|A|A
connect B|B|B|B
This program creates 16 pairs of two threads, where the first thread of a pair creates random messages and the second destroys them.
The program was run once where each thread was executed 10'000'000 times.
The following table provides an overview of the results:
2 TH
4 TH
8 TH
16 TH
32 TH
Walltime [s]
34.6
33.0
44.7
92.4
194.5
Walltime / TH [s]
17.3
8.3
5.6
5.8
6.1
msg rate [kHz]
289.0
606.1
894.9
865.8
822.6
TH rate [kHz]
289.0
303.0
223.7
108.2
51.4
TT
The focus of these tests was to establish a minimal execution period without loosing any messages.
For this purpose the simple network from before was used but wrapped by temporal firewalls:
Src = extern box smx_src_rand( out data )
Snk = box smx_snk_null( in data )
connect tt[10us](Src.Snk)
This program creates three nets, a source, a sink, and a temporal firewall connecting the two.
Each net was executed 1'000'000 times with the execution rate set to different time periods:
10us (10s walltime) < 0.005% DL misses (less than 50)
20us (20s walltime) < 0.001% DL misses (less than 10)
40us (40s walltime) < 0.0005% DL misses (less than 5)
Next, the number of threads was increased to 7 with the following program:
Src = extern box smx_src_rand( out data )
Snk = box smx_snk_null( in data )
A = tt[10us](Src.Snk)
B = A|A|A
connect B
This program creates seven nets, three pairs of source and sink, and one temporal firewall connecting the pairs.
Each net was executed 1'000'000 times with the execution rate set to different time periods:
10us (10s walltime) < 0.01% DL misses (less than 100)
20us (20s walltime) < 0.002% DL misses (less than 20)
40us (40s walltime) < 0.0005% DL misses (less than 5)
The results were a little bit worse that in the case of 3 threads.
This might be caused by the poor scalability of temporal firewalls:
TFs with the same execution rate are combined into one to avoid spawning too many threads.
The cost of this, however, is that the load on a TF increases proportionally to the number of connecting nets.
Given that a TF only has to forward messages this should only become a big deal when dealing with hundreds of threads.
The here presented example is rather special because the other nets are extremely simple and don not consume much CPU power.
Zmq
This test series aims at testing ZMQ performance.
The following network was used to determine read and write speed of ZMQ.
Src1 = extern box smx_src_rand( out data )
Snk1 = extern box smx_snk_zmq( in data, in topic open )
Src2 = extern box smx_src_zmq( out data, out topic )
Snk2 = box smx_snk_dump( in data, in topic )
connect Src1.Snk1|Src2.Snk2
This program produces random data and feeds it to a ZMQ source which produces the data to a port on localhost.
At the same time a ZMQ sink is reading data from the same port on localhost.
Everything is executed as fast as possible.
In a first test the PUB/SUB protocol of ZMQ was used.
The program produced 10'000'000 data points and successfully fed the data to ZMQ in 30s (i.e. at a data-rate of 331k messages per second).
The consuming net was not able to keep up with this rate and lost 19% of the data.
The results are the same for the following endpoints configuration:
In a first test the PIPELINE protocol of ZMQ was used.
In this case the configuration as used above produced different results:
bind: tcp://*:50030, connect: tcp://localhost:50030
The program produced 10'000'000 data points and successfully fed the data to ZMQ in 44s (i.e. at a data-rate of 227k messages per second).
In contrast to the PUB/SUB example, however, no messages were lost and 10'000'000 data points were successfully consumed.
bind: ipc:///tmp/feeds/0, connect: ipc:///tmp/feeds/0
The program produced 10'000'000 data points and successfully fed the data to ZMQ in 47s (i.e. at a data-rate of 213k messages per second).
On most test runs (3 out of 4) less than 500 data points (<0.005%) were lost.
It is interesting to note that when using the ipc transport, some messages are lost and the program is slightly slower compared to the tcp transport (localhost).
A Report on how Streamix performed on ISPW60
This document aims at giving an overview on several performance tests done on the machine
ISPW60
. Note that due to the blocking-log issue the RTS was compiled with the-DSMX_LOG_UNSAFE
flag.Hardware
The command
sudo lshw -short
produces the following output:The command
lscpu
produces the following output:Simple Test Runs
To start things off a simple test run with the following Streamix network was performed:
This program creates two threads, where the first creates random messages and the second destroys them. The program was run three times where in each run each thread was executed 10'000'000 times.
Next, the number of threads was scaled to half the number of cores (4) with the following program:
This program creates two pairs of two threads, where the first thread of a pair creates random messages and the second destroys them. The program was run three times where in each run each thread was executed 10'000'000 times.
Next, the number of threads was scaled to the number of cores (8) with the following program:
This program creates four pairs of two threads, where the first thread of a pair creates random messages and the second destroys them. The program was run three times where in each run each thread was executed 10'000'000 times.
Next, the number of threads was scaled to the number of supported hw-based threads (16) with the following program:
This program creates eight pairs of two threads, where the first thread of a pair creates random messages and the second destroys them. The program was run three times where in each run each thread was executed 10'000'000 times.
Finally, the number of threads was scaled to the double of the number of supported hw-based threads (32) with the following program:
This program creates 16 pairs of two threads, where the first thread of a pair creates random messages and the second destroys them. The program was run once where each thread was executed 10'000'000 times.
The following table provides an overview of the results:
TT
The focus of these tests was to establish a minimal execution period without loosing any messages. For this purpose the simple network from before was used but wrapped by temporal firewalls:
This program creates three nets, a source, a sink, and a temporal firewall connecting the two. Each net was executed 1'000'000 times with the execution rate set to different time periods:
Next, the number of threads was increased to 7 with the following program:
This program creates seven nets, three pairs of source and sink, and one temporal firewall connecting the pairs. Each net was executed 1'000'000 times with the execution rate set to different time periods:
The results were a little bit worse that in the case of 3 threads. This might be caused by the poor scalability of temporal firewalls: TFs with the same execution rate are combined into one to avoid spawning too many threads. The cost of this, however, is that the load on a TF increases proportionally to the number of connecting nets. Given that a TF only has to forward messages this should only become a big deal when dealing with hundreds of threads. The here presented example is rather special because the other nets are extremely simple and don not consume much CPU power.
Zmq
This test series aims at testing ZMQ performance.
The following network was used to determine read and write speed of ZMQ.
This program produces random data and feeds it to a ZMQ source which produces the data to a port on localhost. At the same time a ZMQ sink is reading data from the same port on localhost. Everything is executed as fast as possible.
In a first test the PUB/SUB protocol of ZMQ was used. The program produced 10'000'000 data points and successfully fed the data to ZMQ in 30s (i.e. at a data-rate of 331k messages per second). The consuming net was not able to keep up with this rate and lost 19% of the data.
The results are the same for the following endpoints configuration:
tcp://*:50030
, connect:tcp://localhost:50030
ipc:///tmp/feeds/0
, connect:ipc:///tmp/feeds/0
In a first test the PIPELINE protocol of ZMQ was used. In this case the configuration as used above produced different results:
tcp://*:50030
, connect:tcp://localhost:50030
The program produced 10'000'000 data points and successfully fed the data to ZMQ in 44s (i.e. at a data-rate of 227k messages per second). In contrast to the PUB/SUB example, however, no messages were lost and 10'000'000 data points were successfully consumed.ipc:///tmp/feeds/0
, connect:ipc:///tmp/feeds/0
The program produced 10'000'000 data points and successfully fed the data to ZMQ in 47s (i.e. at a data-rate of 213k messages per second). On most test runs (3 out of 4) less than 500 data points (<0.005%) were lost.It is interesting to note that when using the
ipc
transport, some messages are lost and the program is slightly slower compared to thetcp
transport (localhost).