Open baryluk opened 5 months ago
Thanks for the report.
but I was not expecting 10× worse performance.
So do I. In my environment with Ryzen 9 7950X 16-core CPU, the throughput of a single binary mscp with one connection is about 430MB/s, while the throughput of a normal build is over 1GB/s.
ryzen1 ~/w/m/build > ldd ~/mscp.linux.x86_64.static
not a dynamic executable
ryzen1 ~/w/m/build > ~/mscp.linux.x86_64.static -n 1 ~/5g.img localhost:tmp/
[===============================================] 100% 5.0GB/5.0GB 428.2MB/s 00:13
ryzen1 ~/w/m/build > ldd ./mscp
linux-vdso.so.1 (0x00007fffa1957000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe6df326000)
libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007fe6deee2000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fe6deec6000)
libgssapi_krb5.so.2 => /lib/x86_64-linux-gnu/libgssapi_krb5.so.2 (0x00007fe6dee72000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe6dec49000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe6df49b000)
libkrb5.so.3 => /lib/x86_64-linux-gnu/libkrb5.so.3 (0x00007fe6deb7c000)
libk5crypto.so.3 => /lib/x86_64-linux-gnu/libk5crypto.so.3 (0x00007fe6deb4d000)
libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 (0x00007fe6deb47000)
libkrb5support.so.0 => /lib/x86_64-linux-gnu/libkrb5support.so.0 (0x00007fe6deb39000)
libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 (0x00007fe6deb32000)
libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007fe6deb1e000)
ryzen1 ~/w/m/build > ./mscp -n 1 ~/5g.img localhost:tmp/
[===============================================] 100% 5.0GB/5.0GB 1.1GB/s 00:05
Does the 10x performance degradation happen on other machines? I guess threadripper would be a cause, but I cannot determine it because I don't have it.
The single binary version of mscp uses musl libc for portability, and it is known that musl libc's memory handling causes performance degradation compared with glibc (ref1, ref2).
@upa I will test on some other systems soon.
I will also build locally, with glibc and musl (either on Debian, or in docker container), but with same compiler and flags, and see if that it is.
Could be musl memory allocator or pthread support is subpar (glibc probably scales a bit better to more threads and cores), but I would not expect it to perform only only <10 threads.
But, the fact that binary is not showing any thread at 100% does suggest some lock contention (possibly in the allocator).
I will do some profiling with perf
later.
Nice project, I was going to write something like this, on this weekend (I already have a script to copy a lot of files in parallel, but also needed one to send one big file in parallel, due to network scaling to 25Gbps+), but then did a quick google search and found the PERC papers quickly.
Tested, and it is not too good:
mscp v0.2.1
Using just normal
scp
over localhost (IPv4) I am getting about 430MB/s (scp sending, sshd receiving), or 413MB/s (ssh sending, scp receiving).All files on
tmpfs
in memory.Does not look like a bottlneck on sshd side:
Same results with forcing
-o Cipher=aes128-gcm@openssh.com
, ca. 300MB/s.AMD Threadripper 2950X (Zen+), 16 core (32 threads) CPU, ca. 3.2-4.2GHz
OpenSSH 1:9.6p1-3
OpenSSL 3.2.1-3
Netcat loopback over ::1 (
/dev/zero |nc; nc>/dev/null
), 1.1GB/siperf3 over single tcp on ::1, 22-39 Gbps (without and with -Z option)
Then tested, deb
0.2.1-1~noble
, for Ubuntu, and got 2.3GB/s easily with default 7 threads, and about 2.8GB/s with manual-n 10
(could do more to a remote system, but that is above 25Gbps already, and the other machine with 100Gbps NIC on my network is currently offline).So, the issue clearly looks to be the problem with prebuilt binary. Yes, there is a warning in the readme, but I was not expecting 10× worse performance.