upa / mscp

mscp: transfer files over multiple SSH (SFTP) connections
GNU General Public License v3.0
128 stars 13 forks source link

Prebuilt binary has very very very poor performance #22

Open baryluk opened 5 months ago

baryluk commented 5 months ago

Nice project, I was going to write something like this, on this weekend (I already have a script to copy a lot of files in parallel, but also needed one to send one big file in parallel, due to network scaling to 25Gbps+), but then did a quick google search and found the PERC papers quickly.

Tested, and it is not too good:

mscp v0.2.1

$ ~/mscp.linux.x86_64.static /tmp/usr-share.tar localhost:/tmp/usr-share.tar2
Password: 
[=========>                                                  ]  19%  2.9GB/14.6GB  282.4MB/s

$ rm -f /tmp/usr-share.tar2

$ ~/mscp.linux.x86_64.static -v /tmp/usr-share.tar localhost:/tmp/usr-share.tar2
bitrate limit: 0 bps
Password: 
thread[0]: connecting to localhost
thread[1]: connecting to localhost
thread[2]: connecting to localhost
thread[3]: connecting to localhost
thread[4]: connecting to localhost
thread[5]: connecting to localhost
thread[6]: connecting to localhost
[=====================================>          ]  83% 12.1GB/14.6GB  313.5MB/s

Using just normal scp over localhost (IPv4) I am getting about 430MB/s (scp sending, sshd receiving), or 413MB/s (ssh sending, scp receiving).

All files on tmpfs in memory.

Does not look like a bottlneck on sshd side: image

Same results with forcing -o Cipher=aes128-gcm@openssh.com, ca. 300MB/s.

AMD Threadripper 2950X (Zen+), 16 core (32 threads) CPU, ca. 3.2-4.2GHz

OpenSSH 1:9.6p1-3

OpenSSL 3.2.1-3

Netcat loopback over ::1 (/dev/zero |nc; nc>/dev/null), 1.1GB/s

iperf3 over single tcp on ::1, 22-39 Gbps (without and with -Z option)

Then tested, deb 0.2.1-1~noble, for Ubuntu, and got 2.3GB/s easily with default 7 threads, and about 2.8GB/s with manual -n 10 (could do more to a remote system, but that is above 25Gbps already, and the other machine with 100Gbps NIC on my network is currently offline).

So, the issue clearly looks to be the problem with prebuilt binary. Yes, there is a warning in the readme, but I was not expecting 10× worse performance.

upa commented 5 months ago

Thanks for the report.

but I was not expecting 10× worse performance.

So do I. In my environment with Ryzen 9 7950X 16-core CPU, the throughput of a single binary mscp with one connection is about 430MB/s, while the throughput of a normal build is over 1GB/s.

ryzen1 ~/w/m/build > ldd ~/mscp.linux.x86_64.static
    not a dynamic executable
ryzen1 ~/w/m/build > ~/mscp.linux.x86_64.static -n 1 ~/5g.img localhost:tmp/
[===============================================] 100%  5.0GB/5.0GB  428.2MB/s  00:13 
ryzen1 ~/w/m/build > ldd ./mscp
    linux-vdso.so.1 (0x00007fffa1957000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe6df326000)
    libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007fe6deee2000)
    libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fe6deec6000)
    libgssapi_krb5.so.2 => /lib/x86_64-linux-gnu/libgssapi_krb5.so.2 (0x00007fe6dee72000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe6dec49000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fe6df49b000)
    libkrb5.so.3 => /lib/x86_64-linux-gnu/libkrb5.so.3 (0x00007fe6deb7c000)
    libk5crypto.so.3 => /lib/x86_64-linux-gnu/libk5crypto.so.3 (0x00007fe6deb4d000)
    libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 (0x00007fe6deb47000)
    libkrb5support.so.0 => /lib/x86_64-linux-gnu/libkrb5support.so.0 (0x00007fe6deb39000)
    libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 (0x00007fe6deb32000)
    libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007fe6deb1e000)

ryzen1 ~/w/m/build > ./mscp -n 1 ~/5g.img localhost:tmp/ 
[===============================================] 100%  5.0GB/5.0GB    1.1GB/s  00:05 

Does the 10x performance degradation happen on other machines? I guess threadripper would be a cause, but I cannot determine it because I don't have it.

The single binary version of mscp uses musl libc for portability, and it is known that musl libc's memory handling causes performance degradation compared with glibc (ref1, ref2).

baryluk commented 5 months ago

@upa I will test on some other systems soon.

I will also build locally, with glibc and musl (either on Debian, or in docker container), but with same compiler and flags, and see if that it is.

Could be musl memory allocator or pthread support is subpar (glibc probably scales a bit better to more threads and cores), but I would not expect it to perform only only <10 threads.

But, the fact that binary is not showing any thread at 100% does suggest some lock contention (possibly in the allocator).

I will do some profiling with perf later.