shadowsocks / shadowsocks-org

www.shadowsocks.org
MIT License
871 stars 535 forks source link

Discussion: UDP protocol performance optimization #194

Closed zonyitoo closed 2 years ago

zonyitoo commented 2 years ago

Motivation

QUIC, a UDP based protocol, is now getting famous in the Internet. So the UDP packet relay performance should be taken greater consideration in shadowsocks.

As all we know, the shadowsocks UDP protocol (AEAD) creates a packet in the following steps:

  1. Generate a random IV or salt
  2. Derive a key with HKDF-SHA1
  3. Encrypt the whole data payload with the chosen AEAD cipher

Recently I have done some benchmarks about the cost of each steps. All these tests are written in Rust.

1. Generate a random IV or salt

#[bench]
fn bench_random_32b_iv(b: &mut Bencher) {
    let mut iv = [0u8; 32];
    let mut rng = rand::thread_rng();
    b.iter(|| {
        rng.fill(&mut iv);
    });
}

#[bench]
fn bench_random_32b_iv_shadowsocks(b: &mut Bencher) {
    let mut iv = [0u8; 32];
    b.iter(|| {
        random_iv_or_salt(&mut iv);
    });
}

The test result is:

test bench_random_32b_iv             ... bench:          17 ns/iter (+/- 0)
test bench_random_32b_iv_shadowsocks ... bench:          21 ns/iter (+/- 1)

The random_iv_or_salt takes 4 ns more because it needs to verify the generated iv.

2. Derive a key with HKDF-SHA1

#[bench]
fn bench_hkdf_32b_key(b: &mut Bencher) {
    const SUBKEY_INFO: &'static [u8] = b"ss-subkey";
    const KEY: &[u8] = b"12345678901234567890123456789012";

    let mut iv = [0u8; 32];
    let mut key = [0u8; 32];
    let mut rng = rand::thread_rng();
    b.iter(|| {
        rng.fill(&mut iv);

        let hk = Hkdf::<Sha1>::new(Some(&iv), KEY);
        hk.expand(SUBKEY_INFO, &mut key).expect("hkdf-sha1");
    });
}

The test result is:

test bench_hkdf_32b_key ... bench:       1,117 ns/iter (+/- 83)

From the result of 1. we can know that the HKDF-SHA1 algorithm takes most of the time.

3. Encrypt the whole data payload with the chosen AEAD cipher

#[bench]
fn bench_udp_packet_1500(b: &mut Bencher) {
    const KEY: &[u8] = b"12345678901234567890123456789012";

    let mut iv = [0u8; 32];

    let mut input_packet = [0u8; 1500];
    rand::thread_rng().fill_bytes(&mut input_packet);

    b.iter(|| {
        random_iv_or_salt(&mut iv);
        let mut cipher = Cipher::new(CipherKind::AES_256_GCM, KEY, &iv);
        cipher.encrypt_packet(&mut input_packet);
    });
}

The Cipher that I was using here is from shadowsocks-crypto, which is the actual library that was using in shadowsocks-rust. The test result is:

test bench_udp_packet_1500 ... bench:       2,218 ns/iter (+/- 328)

Analysis

Please forget about the absolute numbers of each tests. When we compare the result in 3. and 2., we can easily make a conclusion: The key derivation process takes roughly 50% of time when creating a UDP packet. So if we can optimize this process, we may get at most 50% of performance improvement!

There should be no way to optimize it without changing the protocol. Here are some possible design:

  1. UDP Session. A client should first handshake with a server to create a session with a new derived key, and then encrypt packets with that derived key.
  2. ... (discussion)
madeye commented 2 years ago

Generate derived keys in background, and save them into a cache for future usage?

dev4u commented 2 years ago

我也在寻求方法,实现在客户端与服务端之间PINGPONG,从而能传递临时会话key的功能。

zonyitoo commented 2 years ago

Generate derived keys in background, and save them into a cache for future usage?

Well, then we may have to add synchronization between the derive key task and UDP association tasks. The lock may become a new bottleneck.

Another solution may be using a faster KDF, for example, HMAC-Blake3:

#[bench]
fn bench_blake3(b: &mut Bencher) {
    const SUBKEY_INFO: &[u8] = b"ss-subkey";
    const KEY: &[u8] = b"12345678901234567890123456789012";

    let mut iv = [0u8; 32];
    let mut key = [0u8; 32];
    let mut rng = rand::thread_rng();
    b.iter(|| {
        rng.fill(&mut iv);

        let hk = SimpleHkdf::<blake3::Hasher>::new(Some(&iv), KEY);
        hk.expand(SUBKEY_INFO, &mut key).expect("hkdf-sha1");
    });
}

The result shows:

test bench_blake3       ... bench:         915 ns/iter (+/- 79)
test bench_hkdf_32b_key ... bench:       1,115 ns/iter (+/- 56)

The HMAC algorithm should be the key for optimization.

zonyitoo commented 2 years ago

Generate derived keys in background, and save them into a cache for future usage?

There are several obvious flaws in this method:

  1. Since the bottleneck is the CPU consumption, pre-generating IV and keys in background won't save any CPU usages. So the overall performance didn't change.
  2. It still have to do it synchronously when receiving packets from remote (both sslocal and ssserver).

So I don't think this will solve this problem effectively.

dev4u commented 2 years ago

我的意见不是太专业,请不要介意。 我已经把ss rust结合了google 2fa的算法,实现了一个与时间相关的“随机会话key”。 google 2fa的好处是,相同时间因子、secret产生的随机数是相同的,并且算法简单。所以双边协商好secret后,建立会话时,不用传输key,两边各自通过计算,即可得到相同的临时key……

但是这个方式有个问题--时间因子。因为各种设备,会有各自的时间,虽然我通过ntp方式,让服务器上的时间,尽可能同步(与手机的时间有±2秒内的误差),也会在边界附近,因为时间误差而导致产生的key不相同。

这个我尝试了一些优化,但还是不太理想,主要是人处于高速移动中,因为手机频繁切换基站,已经创建好的连接因为不会关闭,导致通讯仍使用旧的key传输,服务器解密失败而影响使用质量。当然如果是udp,这个影响会小很多。

dev4u commented 2 years ago

现在我在努力的事情是,如何安全传输key产生的时间因子。暂时的方案是,直接附在密文的末尾处,但这个因为改变了报文格式,不太倾向这种方式。 还要考虑传输的时机要满足随机性,以免产生特征。

madeye commented 2 years ago

I did a quick calculation based on your benchmark:

10^6us / 2.2us * 1500byte = 681 MB/s bandwidth per CPU core

So, it looks to me that, for most of our users, the bottleneck is their internet speed...

zonyitoo commented 2 years ago

Yes, it's true. I made some simple speedtests locally with iperf3:

iperf3 -c => sslocal (tunnel 5202) => ssserver => iperf3 -s

On my laptop (i7-9750H), when testing with iperf3 -c 127.0.0.1 -p 5202 -Rub 5.0G -t 300, result shows:

Connecting to host 127.0.0.1, port 5202
Reverse mode, remote host 127.0.0.1 is sending
[  5] local 127.0.0.1 port 56587 connected to 127.0.0.1 port 5202
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec   590 MBytes  4.95 Gbits/sec  0.037 ms  357/38225 (0.93%)
[  5]   1.00-2.00   sec   594 MBytes  4.98 Gbits/sec  0.016 ms  141/38285 (0.37%)
[  5]   2.00-3.00   sec   587 MBytes  4.92 Gbits/sec  0.028 ms  565/38259 (1.5%)
[  5]   3.00-4.00   sec   595 MBytes  5.00 Gbits/sec  0.055 ms  46/38279 (0.12%)
[  5]   4.00-5.00   sec   594 MBytes  4.98 Gbits/sec  0.018 ms  137/38259 (0.36%)
[  5]   5.00-6.00   sec   579 MBytes  4.85 Gbits/sec  0.020 ms  1123/38274 (2.9%)
[  5]   6.00-7.00   sec   554 MBytes  4.65 Gbits/sec  0.016 ms  2697/38274 (7%)
[  5]   7.00-8.00   sec   589 MBytes  4.94 Gbits/sec  0.023 ms  454/38262 (1.2%)
[  5]   8.00-9.00   sec   557 MBytes  4.68 Gbits/sec  0.023 ms  2478/38266 (6.5%)

and both sslocal and ssserver CPU usage reached above 87%. But the TCP test (iperf3 -c 127.0.0.1 -p 5202 -R -t 300) results shows that:

Connecting to host 127.0.0.1, port 5202
Reverse mode, remote host 127.0.0.1 is sending
[  5] local 127.0.0.1 port 60416 connected to 127.0.0.1 port 5202
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   806 MBytes  6.76 Gbits/sec
[  5]   1.00-2.00   sec   843 MBytes  7.07 Gbits/sec
[  5]   2.00-3.00   sec   839 MBytes  7.04 Gbits/sec
[  5]   3.00-4.00   sec   835 MBytes  7.00 Gbits/sec
[  5]   4.00-5.00   sec   759 MBytes  6.37 Gbits/sec
[  5]   5.00-6.00   sec   806 MBytes  6.76 Gbits/sec
[  5]   6.00-7.00   sec   831 MBytes  6.97 Gbits/sec
[  5]   7.00-8.00   sec   822 MBytes  6.90 Gbits/sec
[  5]   8.00-9.00   sec   817 MBytes  6.85 Gbits/sec
[  5]   9.00-10.00  sec   773 MBytes  6.49 Gbits/sec
[  5]  10.00-11.00  sec   765 MBytes  6.42 Gbits/sec
[  5]  11.00-12.00  sec   732 MBytes  6.14 Gbits/sec
[  5]  12.00-13.00  sec   810 MBytes  6.79 Gbits/sec
[  5]  13.00-14.00  sec   712 MBytes  5.97 Gbits/sec
[  5]  14.00-15.00  sec   798 MBytes  6.69 Gbits/sec

CPU consumption is sslocal 128% and ssserver 82%. TCP channels can have higher brandwidth limit than the UDP associations.

You are correct about average users won't get to this extreme cases, so we can put away these numbers and focus on: how to make UDP associations to have the same performance as TCP channels. Or maybe, UDP associations should have better performance than the TCP channels because UDP tests don't have protocol overheads like acks, ...

database64128 commented 2 years ago

I have been doing experiments and benchmarks of changing the current Shadowsocks AEAD protocol. If standardized, the spec will be maintained by the Shadowsocks.NET organization.

Shadowsocks 2022 Edition

Goals

Non-Goals

PSK

The popular VPN protocol WireGuard provides a simple userspace program wg (usually packaged in wireguard-tools by distributions) to generate cryptographically-secure 32-byte private keys and PSKs. The keys are encoded in base64 for convenience.

Instead of asking the user to provide a password, Shadowsocks 2022 is taking the same approach. A user can use wg genkey or wg genpsk (genkey and genpsk point to the same underlying implementation.) to generate a base64-encoded 32-byte PSK. This change drops dependency on EVP_BytesToKey.

Subkey Derivation

HKDF_SHA1 is replaced by BLAKE3's key derivation mode. A randomly generated 32-byte salt is appended to the PSK to be used as key material.

session_subkey := blake3::derive_key(context: "shadowsocks 2022 session subkey", key_material: key + salt)

I believe BLAKE3's key derivation mode alone without HKDF is secure enough for this purpose.

Required Method

Method 2022-blake3-chacha20-poly1305 MUST be implemented by all implementations. 2022 reflects the fast-changing and flexible nature of the protocol.

TCP

ChaCha20-Poly1305

2022-blake3-chacha20-poly1305 uses ChaCha20-Poly1305 with derived subkeys for TCP sessions. The construction is the same as Shadowsocks AEAD.

Protocol

tcp request header := 1B type + 64-bit unix epoch timestamp [+ atyp + address + port] + 2B padding length [+ padding]
tcp response header := 1B type + 64-bit unix epoch timestamp [+ request salt]
tcp stream := 32B salt [+ AEAD(length) + AEAD(message)] [+...]

HeaderTypeClientStream = 0
HeaderTypeServerStream = 1
MinPaddingLength = 0
MaxPaddingLength = 900

The first message MUST be the header or start with the header.

Replay Protection

Both server and client MUST record all incoming salts during the last 30 seconds. When a new TCP session is started, the first received message is decrypted and its timestamp MUST be check against system time. If the time difference is within 30 seconds, then the salt is checked against all stored salts. If no repeated salt is discovered, then the salt is added to the pool and the session is successfully established.

UDP

XChaCha20-Poly1305

The XChaCha20-Poly1305 construction can safely encrypt a practically unlimited number of messages with the same key, without any practical limit to the size of a message (up to ~ 2^64 bytes).

As an alternative to counters, its large nonce size (192-bit) allows random nonces to be safely used.

The official Go implementation of ChaCha20-Poly1305 provides XChaCha20-poly1305. RustCrypto's chacha20poly1305 crate also provides XChaCha20-Poly1305.

Protocol

udp client header := 8B client session id + 8B packet id + 1B type + 64-bit unix epoch timestamp + 2B padding length [+ padding] [+ atyp + address + port]
udp server header := 8B server session id + 8B packet id + 1B type + 64-bit unix epoch timestamp + 8B client session id + 2B padding length [+ padding] [+ atyp + address + port]
udp packet := 24B nonce + AEAD(message)

HeaderTypeClientPacket = 0
HeaderTypeServerPacket = 1

2022-blake3-chacha20-poly1305 uses XChaCha20-Poly1305 with PSK as the key and a random nonce for each message. The session ID identifies a UDP session. The packet ID is a counter for sliding window replay protection.

Session ID based Routing and Sliding Window Replay Protection

An implementation SHOULD implement a NAT table using session ID as the key. A NAT entry SHOULD at least store the following information:

Upon receiving an encrypted packet, the packet is decrypted using the first 24 bytes as nonce. The header is verified by checking the timestamp against system time. Then the session ID in the header is used to look up its NAT entry. If the lookup is successful, pass the packet ID to the sliding window filter to complete verification.

The last seen time and address MUST be updated after packet verification. Updating last seen address ensures that, when the client changes network, the session won't be interrupted.

Optional Methods

Reduced-round variants of ChaCha20-Poly1305

2022-blake3-chacha8-poly1305 and 2022-blake3-chacha12-poly1305 are optional methods that use reduced-round variants of ChaCha20-Poly1305 and XChaCha20-Poly1305.

According to this paper, 8 rounds of ChaCha should be secure, while yielding a 2.5x speedup. I asked @zonyitoo to benchmark these reduced-round variants provided by RustCrypto. Results:

test bench_crypto_chacha12_ietf_poly1305_encrypt  ... bench:       1,064 ns/iter (+/- 151)
test bench_crypto_chacha20_ietf_poly1305_encrypt  ... bench:       1,376 ns/iter (+/- 146)
test bench_crypto_chacha8_ietf_poly1305_encrypt   ... bench:         993 ns/iter (+/- 209)
test bench_crypto_xchacha12_ietf_poly1305_encrypt ... bench:       1,116 ns/iter (+/- 117)
test bench_crypto_xchacha20_ietf_poly1305_encrypt ... bench:       1,436 ns/iter (+/- 181)
test bench_crypto_xchacha8_ietf_poly1305_encrypt  ... bench:         962 ns/iter (+/- 93)
test bench_ring_aes_128_gcm_encrypt               ... bench:         346 ns/iter (+/- 42)
test bench_ring_aes_256_gcm_encrypt               ... bench:         455 ns/iter (+/- 49)
test bench_ring_chacha20_ietf_poly1305_encrypt    ... bench:         801 ns/iter (+/- 92)

Unfortunately, RustCrypto still hasn't quite catch up with ring in terms of performance. And ring does not currently provide an implementation for these reduced-round variants.

I have not been able to find any well-maintained Go implementations of these reduced-round variants.

AES

2022-blake3-aes-256-gcm is an optional method that replaces ChaCha20-Poly1305 with AES-256-GCM for TCP. For UDP, unfortunately, there's no AES counterpart of XChaCha20-Poly1305. We have two options:

  1. Simply use AES-256-GCM in place of XChaCha20-Poly1305 and accept the requirement/risk that the same PSK cannot be used to encrypt more than 2^32 messages.
  2. Use a separate header for session ID and packet ID. These two happen to fit in a single AES block. We can simple AES encrypt the separate header. Since the session ID is unique for each session, and the packet ID is used as counter, this method should be secure.

Separate Header Design

packet header := 8B session id + 8B packet id
message header := 1B type + 64-bit unix epoch timestamp + 2B padding length [+ padding] [+ atyp + address + port]
AEAD key (session subkey) := blake3::derive_key(context: "shadowsocks 2022 session subkey", key_material: key + session_id)
AEAD nonce := packet_header[4:16]
udp_packet := aes-ecb(packet header) + AEAD(message header + payload)

This design gives a 2.1x speedup, as shown in the benchmarks below. Note that the benchmarks were run on a processor without vaes and vpclmulqdq instructions. On newer processors with these instructions, AES-GCM is expected to be twice as fast.

Benchmarks

1. Header benchmarks

https://github.com/database64128/cubic-go-playground/blob/main/shadowsocks/udpheader/udpheader_test.go

BenchmarkGenSaltHkdfSha1-8                372712          3283 ns/op        1264 B/op         18 allocs/op
BenchmarkGenSaltBlake3-8                  770426          1554 ns/op           0 B/op          0 allocs/op
BenchmarkAesEcbHeaderEncryption-8       99224395            12.84 ns/op        0 B/op          0 allocs/op
BenchmarkAesEcbHeaderDecryption-8       100000000           11.79 ns/op        0 B/op          0 allocs/op

2. Full UDP packet construction benchmarks

https://github.com/database64128/cubic-go-playground/blob/main/shadowsocks/udp_test.go

BenchmarkShadowsocksAEADAes256GcmEncryption-8                 252819          4520 ns/op        2160 B/op         23 allocs/op
BenchmarkShadowsocksAEADAes256GcmWithBlake3Encryption-8       320718          3367 ns/op         896 B/op          5 allocs/op
BenchmarkDraftSeparateHeaderAes256GcmEncryption-8            2059383           590.5 ns/op         0 B/op          0 allocs/op
BenchmarkDraftXChaCha20Poly1305Encryption-8                   993336          1266 ns/op           0 B/op          0 allocs/op

Acknowledgement

I would like to thank @zonyitoo, @xiaokangwang, and @nekohasekai for their input on the design of the protocol.

database64128 commented 2 years ago

Open Questions

  1. @zonyitoo Are you willing to trade complexity (adding a separate header) for a fairly significant boost (2.1x) of performance?

Reference Implementations

  1. Shadowsocks-NET/outline-ss-server Progress: All done, other than UDP multiplexing.

/cc @madeye @Mygod @riobard from ss org /cc @fortuna @bemasc from Outline

riobard commented 2 years ago

@zonyitoo

how to make UDP associations to have the same performance as TCP channels. Or maybe, UDP associations should have better performance than the TCP channels because UDP tests don't have protocol overheads like acks

I'm sorry but UDP tunnel in userspace won't reach the same performance as TCP because stream-based abstraction enjoys extensive optimization from the kernel while packet-based abstraction doesn't.

zonyitoo commented 2 years ago

Ah, thanks for your work.

Are you willing to trade complexity (adding a separate header) for a fairly significant boost (2.1x) of performance?

It doesn't seem to add too much complexity on implementation, so I am Ok with it.

I wouldn't suggest to use xchacha20-ietf-poly1305 separately for the UDP protocol because there is currently no fast implementation in Rust (or Go, or other languages except C).

In terms of UDP session, how to generate a globally unique session ID?

zonyitoo commented 2 years ago

@zonyitoo

how to make UDP associations to have the same performance as TCP channels. Or maybe, UDP associations should have better performance than the TCP channels because UDP tests don't have protocol overheads like acks

I'm sorry but UDP tunnel in userspace won't reach the same performance as TCP because stream-based abstraction enjoys extensive optimization from the kernel while packet-based abstraction doesn't.

Well yes, I did some tests with iperf3 directly on the lo interface:

TCP, iperf3 -c 127.0.0.1 -p 5201 -R

Connecting to host 127.0.0.1, port 5201
Reverse mode, remote host 127.0.0.1 is sending
[  5] local 127.0.0.1 port 54944 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  7.47 GBytes  64.1 Gbits/sec
[  5]   1.00-2.00   sec  7.73 GBytes  66.4 Gbits/sec
[  5]   2.00-3.00   sec  7.46 GBytes  64.0 Gbits/sec
[  5]   3.00-4.00   sec  7.84 GBytes  67.3 Gbits/sec
[  5]   4.00-5.00   sec  7.84 GBytes  67.3 Gbits/sec
[  5]   5.00-6.00   sec  7.87 GBytes  67.6 Gbits/sec
[  5]   6.00-7.00   sec  7.77 GBytes  66.8 Gbits/sec
[  5]   7.00-8.00   sec  7.83 GBytes  67.2 Gbits/sec
[  5]   8.00-9.00   sec  7.60 GBytes  65.3 Gbits/sec
[  5]   9.00-10.00  sec  7.87 GBytes  67.6 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  77.3 GBytes  66.4 Gbits/sec                  sender
[  5]   0.00-10.00  sec  77.3 GBytes  66.4 Gbits/sec                  receiver

UDP, iperf3 -c 127.0.0.1 -p 5201 -Rub 0

Connecting to host 127.0.0.1, port 5201
Reverse mode, remote host 127.0.0.1 is sending
[  5] local 127.0.0.1 port 62256 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec  3.22 GBytes  27.7 Gbits/sec  0.001 ms  0/211709 (0%)
[  5]   1.00-2.00   sec  3.27 GBytes  28.1 Gbits/sec  0.001 ms  0/214696 (0%)
[  5]   2.00-3.00   sec  3.28 GBytes  28.2 Gbits/sec  0.001 ms  0/215589 (0%)
[  5]   3.00-4.00   sec  3.21 GBytes  27.6 Gbits/sec  0.001 ms  178/211124 (0.084%)
[  5]   4.00-5.00   sec  3.27 GBytes  28.1 Gbits/sec  0.001 ms  0/214772 (0%)
[  5]   5.00-6.00   sec  3.27 GBytes  28.1 Gbits/sec  0.001 ms  0/214951 (0%)
[  5]   6.00-7.00   sec  3.25 GBytes  27.9 Gbits/sec  0.001 ms  0/213482 (0%)
[  5]   7.00-8.00   sec  3.27 GBytes  28.1 Gbits/sec  0.001 ms  0/214980 (0%)
[  5]   8.00-9.00   sec  3.26 GBytes  28.0 Gbits/sec  0.001 ms  0/214555 (0%)
[  5]   9.00-10.00  sec  3.19 GBytes  27.4 Gbits/sec  0.001 ms  905/210659 (0.43%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  32.5 GBytes  27.9 Gbits/sec  0.000 ms  0/2136530 (0%)  sender
[  5]   0.00-10.00  sec  32.5 GBytes  27.9 Gbits/sec  0.001 ms  1083/2136517 (0.051%)  receiver

Conclusion: UDP can only have 50% bandwidth of TCP. Hmm....

But @riobard , the lo device can actually handle 27.9Gbps UDP bandwidth, so the UDP tunnel (or association) implementation we are doing here can achieve only 3Gbps (without lost) and 5Gbps (5% packet lost) is far from reaching the limit of the internet device.

If we can lower the price of each UDP packets, the brandwidth should at least level with the current TCP implementation (7Gbps).

how to make UDP associations to have the same performance as TCP channels.

The UDP associations could have the same performance (7Gbps on my laptop) as the TCP channels.

riobard commented 2 years ago

@zonyitoo Don't test lo. Test real ethernet devices, preferably over WiFi, and see what's the real-world impact.

database64128 commented 2 years ago

I wouldn't suggest to use xchacha20-ietf-poly1305 separately for the UDP protocol because there is currently no fast implementation in Rust (or Go, or other languages except C).

XChaCha20-Poly1305 is not an IETF standard. And it should be very easy to wrap a fast ChaCha20-Poly1305 implementation into XChaCha20-Poly1305 by adding an HChaCha20 layer.

In terms of UDP session, how to generate a globally unique session ID?

The session ID is 64-bit long. I wouldn't worry about the probability of collision of two randomly generated 64-bit integers. If you don't want random numbers, it's still safe to use a counter as session ID, as long as you stick with 2022-blake3-xchacha20-poly1305. With 2022-blake3-aes-256-gcm, however, we don't ever want two pieces of identical plaintext encrypted by plain non-AEAD AES.

riobard commented 2 years ago

@database64128 Is IETF C20P1305 slow and/or broken? If not, please refrain from adding even more choices/confusions to the current mess. We're already having a hard time explaining which cipher is the best choice for the average users. The mistakes from the horribly long list of stream ciphers must be avoided.

database64128 commented 2 years ago

@riobard I think in the spec we can advise implementations to gate optional ciphers behind optional features. Advanced users and developers can build their own binary by enabling the optional features.

The spec only requires implementing one method 2022-blake3-chacha20-poly1305, which uses IETF ChaCha20-Poly1305 for TCP, and XChaCha20-Poly1305 for UDP. We cannot use ChaCha20-Poly1305 for UDP, because each packet uses a randomly generated nonce.

zonyitoo commented 2 years ago

@zonyitoo Don't test lo. Test real ethernet devices, preferably over WiFi, and see what's the real-world impact.

BTW, we should focus more on the protocol itself about how to lower the cost of CPU resource in UDP protocol. On some devices that have low calculation capability, like mobile phones, or routers, they will gain lots of benefit if we can make the protocol faster.

riobard commented 2 years ago

@database64128 No, the intended goal is to reduce the number of optional ciphers, ideally leaving only one mandatory cipher (which is IETF C20P1305), so there's no need for average users to even think about the choice. If you read carefully the current spec, that's exactly what is being written:

Compliant Shadowsocks implementations must support AEAD_CHACHA20_POLY1305. Implementations for devices with hardware AES acceleration should also implement AEAD_AES_128_GCM and AEAD_AES_256_GCM.

The fact that various implementations decide to add other AEAD ciphers is very unfortunate, as it creates more confusion for little benefit. You failed to explain why it is necessary to create another cipher, and I don't see any benefits changing the status quo.

riobard commented 2 years ago

@zonyitoo I'd like to but there's only so much you could do with UDP in userspace.

zonyitoo commented 2 years ago

@database64128 No, the intended goal is to reduce the number of optional ciphers, ideally leaving only one mandatory cipher (which is IETF C20P1305), so there's no need for average users to even think about the choice. If you read carefully the current spec, that's exactly what is being written:

Compliant Shadowsocks implementations must support AEAD_CHACHA20_POLY1305. Implementations for devices with hardware AES acceleration should also implement AEAD_AES_128_GCM and AEAD_AES_256_GCM.

The fact that various implementations decide to add other AEAD ciphers is very unfortunate, as it creates more confusion for little benefit. You failed to explain why it is necessary to create another cipher, and I don't see any benefits changing the status quo.

Well, I think what @database64128 's proposal said is TCP protocol uses only chacha20-ietf-poly1305 and UDP protocol uses only xchacha20-ietf-poly1305. So users won't need to choose a method, they are mandatory and fixed in this proposal.

database64128 commented 2 years ago

We're already having a hard time explaining which cipher is the best choice for the average users. The mistakes from the horribly long list of stream ciphers must be avoided.

No, the intended goal is to reduce the number of optional ciphers, ideally leaving only one mandatory cipher (which is IETF C20P1305), so there's no need for average users to even think about the choice.

@riobard I don't see any problem with letting the user select from existing Shadowsocks AEAD ciphers. The only practical difference between existing Shadowsocks AEAD ciphers is probably performance.

My spec only has ONE mandatory method: 2022-blake3-chacha20-poly1305. The use of ChaCha20-Poly1305 for TCP and XChaCha20-Poly1305 for UDP is an implementation detail, and is transparent to users.

To give users some flexibility, some optional methods are suggested, only because they are just as secure, and can yield significant performance boosts. For example, switching from ChaCha20-Poly1305 to AES-256-GCM increases the maximum TCP throughput by 25%, the separate header proposal for UDP is twice as fast as the mandatory XChaCha20-Poly1305 construction. Please keep in mind that performance boosts translate to less energy consumption on resource-constrained devices.

riobard commented 2 years ago

@zonyitoo No, he's also changing key derivation procedure which breaks backward compatibility for no benefits. To the end users, they'll just see one more entry in the (already long enough) list of available ciphers. The technical details are irrelevant to the discussion of reducing optional choices (and thus complexity).

@database64128 Two points:

I don't see any problem with letting the user select from existing Shadowsocks AEAD ciphers.

That's your opinion and I disagree. Complexity is the root the all evil in software. That's why TLS 1.3 is getting rid of most options and there's not even a choice of cipher in WireGuard.

switching from ChaCha20-Poly1305 to AES-256-GCM increases the maximum TCP throughput by 25%

This only happens on devices with hardware-accelerated AES instructions, and even on those devices the software needs to be carefully designed to properly use those instructions. On iOS (arguably one of the easier platforms to support), it was very late (IIRC in 2021 at earliest) when client apps (e.g. Surge) could figure out how to correctly make use of AES acceleration.

And the performance boost isn't really worth the additional complexity and potential implementation defects (AES/GCM is notoriously difficult to get right). Optimized implementation of C20P1305 is more than enough on the majority of modern devices.

So no, I just don't see the benefits of the proposal.

database64128 commented 2 years ago

No, he's also changing key derivation procedure which breaks backward compatibility for no benefits.

What backward compatibility are you talking about? The ability for a user to choose an arbitrary password? I don't see how maintaining such "compatibility" could provide any benefits.

That's your opinion and I disagree. Complexity is the root the all evil in software. That's why TLS 1.3 is getting rid of most options and there's not even a choice of cipher in WireGuard.

And here I am, getting rid of the uncertainty of user-provided password, so we don't have to worry about choosing a secure key derivation algorithm for passwords. HKDF_SHA1 is replaced by BLAKE3's key derivation mode because HKDF_SHA1 is slow, and because I don't want to use anything marked as "obsolete" in the year of 2022.

This only happens on devices with hardware-accelerated AES instructions, and even on those devices the software needs to be carefully designed to properly use those instructions.

This is a very cynical take on the issue. And if you are not confident about using AES, just stick to the default ChaCha20-Poly1305.

zonyitoo commented 2 years ago

This only happens on devices with hardware-accelerated AES instructions, and even on those devices the software needs to be carefully designed to properly use those instructions. On iOS (arguably one of the easier platforms to support), it was very late (IIRC in 2021 at earliest) when client apps (e.g. Surge) could figure out how to correctly make use of AES acceleration.

Err.. Since we are talking about a new protocol for the future, I think we can assume that end users are using devices with hard-accelerated AES instructions (mostly aarch64, x86_64).

test bench_ring_aes_128_gcm_encrypt               ... bench:         346 ns/iter (+/- 42)
test bench_ring_aes_256_gcm_encrypt               ... bench:         455 ns/iter (+/- 49)
test bench_ring_chacha20_ietf_poly1305_encrypt    ... bench:         801 ns/iter (+/- 92)

These tests are done on my laptop (x86_64) and you can see the difference.

Optimized implementation of C20P1305 is more than enough on the majority of modern devices.

Well, from the test shown above, the optimized C20P1305 is still about 50% slower than aes-256-gcm.

No, he's also changing key derivation procedure which breaks backward compatibility for no benefits.

Hmm? I don't think so, the current AEAD protocol should remain unchanged. All the current discussion is only applied to the new protocol. The version 1 (stream protocol), version 3 (AEAD protocol) will not be changed. @riobard

Since SHA1 is marked as cryptographically broken, it is a good chance to replace it with a new modern hash function. I am Ok with this proposal to choose Blake3, because it is actually faster than HKDF-SHA1 in tests.

test bench_blake3            ... bench:         148 ns/iter (+/- 14)
test bench_ring_hkdf_32b_key ... bench:         839 ns/iter (+/- 47)
test bench_hkdf_blake3       ... bench:         857 ns/iter (+/- 91)

The 1st test is blake3::derive_key with [32-bits key] + [32-bits random generated IV]. The 2nd test is the current key derivation method, HKDF-SHA1. The 3rd test is HKDF-BLAKE3.

As for EVP_BytesToKey, maybe we can keep it unchanged, there is no obvious benefit to force users to make a key with exactly length of bytes as the cipher required. If users generate their key with secured tools like genpsk, we can provide a "key" field to read it directly without EVP_BytesToKey. @database64128

zonyitoo commented 2 years ago

Maybe we should set aside the topic about using exactly 1 chosen cipher or keep the 3 selected ones in version 3 (AEAD protocol). We should discuss more about the design of the protocol itself.

database64128 commented 2 years ago

As for EVP_BytesToKey, maybe we can keep it unchanged, there is no obvious benefit to force users to make a key with exactly length of bytes as the cipher required. If users generate their key with secured tools like genpsk, we can provide a "key" field to read it directly without EVP_BytesToKey.

I disagree.

zonyitoo commented 2 years ago

Ah, it uses MD5, I just remembered.

Alright, how about using another KDF to replace it? Generating one by hand is not user friendly in most cases.

database64128 commented 2 years ago

Generating one by hand is not user friendly in most cases.

It's actually much more user-friendly than the current best practice of generating a password in your password manager GUI, then copy-paste it to your config files.

Now all you need is a cryptographically-secure 32-byte key. You can generate one with a one-liner in your favorite shell, which anyone running a Shadowsocks server should be familiar with. Or you can use existing userspace programs like wg, which does nothing more than a simple getrandom(2) system call. Or if you are feeling generous, add a quick command to sslocal and ssserver.

When we ask a user to input a password, they may not bother to actually generate a secure one. But when we ask that they must provide a base64-encoded 32-byte key, it's very unlikely that any weak key gets used, unless the user very much intends to do so.

zonyitoo commented 2 years ago

When we ask a user to input a password, they may not bother to actually generate a secure one. But when we ask that they must provide a base64-encoded 32-byte key, it's very unlikely that any weak key gets used, unless the user very much intends to do so.

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=, :P

Now all you need is a cryptographically-secure 32-byte key. You can generate one with a one-liner in your favorite shell, which anyone running a Shadowsocks server should be familiar with.

I do agree with you but we just cannot ask users what to do. Users can generate a 32 bytes key with any tools.

For users who know exactly what they are doing, they can generate one in Base64 encoded and pass then to the "key" field, and the other average users can choose to provide a "password" with printable characters.

Providing one won't take any hard efforts and it only run once when the process starts.

How about Argon2, which have been proven to be a cryptographically secured KDF function (password hash).

database64128 commented 2 years ago

Maybe you should ask Jason A. Donenfeld why WireGuard does not derive its keys from passwords. :P

I do agree with you but we just cannot ask users what to do.

Currently we require that users provide the correct text representation of the method name. Is that something difficult for an average user? Should we lower that requirement so any weird typo or misspelling can be auto-corrected? /s

Currently we require that users use the correct format of JSON for config files. Should we lower the requirement so any text file is accepted? /s

For users who know exactly what they are doing, they can generate one in Base64 encoded and pass then to the "key" field, and the other average users can choose to provide a "password" with printable characters.

Only server admins are supposed to generate keys. You probably should not run your own Shadowsocks server if you can't even get the keys right.

riobard commented 2 years ago

You're mixing several issues into this one thread and it's really difficult to respond individually.

You are getting trapped into the thinking that more performance is always good, regardless of complexity or availability. You like WireGuard so much, now answer this question: why doesn't WireGuard use hardware-accelerated AES-GCM?

The Shadowsocks project has never aimed for maximum performance, but acceptable performance that is easy to achieve on different hardware architectures/operating systems/software libraries. If you really want maximum performance, you guys should stop working on the toy stuff on hand right now and dig into OS kernel to figure out how to eliminate wasted cycles shuffling bytes from/to userspace. THAT would be more interesting challenges.

Yes, MD5 is broken. Yes, even SHA1 is broken. Yes, Blake3 is fast! Yes, Argon2 is even better! But HMAC is proved to be sound and HMAC-SHA1 as KDF is perfectly fine. It might be slow, but for the primary use case of making Shadowsocks TCP connection, KDF going from 1ms to 0.1ms won't make any difference.

Which leads to my final question: what's the point of making a completely new version of Shadowsocks at all? What benefits does it bring to the table? You would think client apps would all happily buy in to this new scheme and drop old v1 and v3 versions? No, it would just be a burden for client app authors to include more dependencies, more time to write code and test, and more confusion to end users to select yet another option in the long list of “methods”.

database64128 commented 2 years ago

V2Ray's VMess protocol also does not accept a password, and instead uses UUIDs. I have not seen any complaints from server admins or users about it. 😀

Arguably one of the biggest mistakes of Shadowsocks is that it's usually not secure by default. Therefore, it's one of the main goals of this proposal that the protocol MUST be secure by default. Replacing the user-defined password with a fixed-length key is a step in the right direction, in my opinion.

database64128 commented 2 years ago

You are getting trapped into the thinking that more performance is always good, regardless of complexity or availability.

What made you think so? The default and mandatory method of this proposal uses (X)ChaCha20-Poly1305, not AES-GCM.

I believe that extra performance without adding too much complexity or sacrificing security is something nice to have. That's why I included these OPTIONAL methods that use AES-GCM.

What benefits does it bring to the table?

It's all listed in the spec, but in case you didn't read carefully:

riobard commented 2 years ago

There're always nice things to have. There're also costs associated with them. If you don't see the costs, well, good luck having your ideas adopted by others. Most of the bulletin points you mentioned have been extensively discussed before with various proposed solutions. Guess why none actually gets much traction?

zonyitoo commented 2 years ago

No, it would just be a burden for client app authors to include more dependencies, more time to write code and test, and more confusion to end users to select yet another option in the long list of “methods”.

Well, just like the WPA standards in Wi-Fi, we can have at least WPA, WPA/WPA2, WPA2, WPA2/WPA3, WPA3 in the selected list when you are configuring a Wi-Fi hotspots. So for lowing the cost of end users to understand what exactly the differences between these methods, the WPA3 shouldn't exist at all? No, of course.

There is no technology that can say it will always be the best.

The Shadowsocks project has never aimed for maximum performance, but acceptable performance that is easy to achieve on different hardware architectures/operating systems/software libraries.

Well you are right, but you didn't get my point. If we can make it faster, why shouldn't we list it into the discussion list and see what is blocking us to get the best performance? There will be compromise, of course, but first we should find the obstacles that blocking us.

zonyitoo commented 2 years ago

Again, I suggest we should keep these things aside and discuss the protocol itself.

Guess why none actually gets much traction?

Because there is no one acting as a manager to keep the discussion on.

database64128 commented 2 years ago

Guess why none actually gets much traction?

Because they try to achieve these goals without changing the protocol, which is impossible.

Because no one actually tried to implement these possible solutions. They remained as ideas and never got implemented.

Because no one has written a complete spec. It's just thoughts and ideas all over the place.

And no offense, but it's probably because members of the Shadowsocks organization actively discourage people from proposing any changes to the protocol or outright reject even the idea of changing the protocol.

riobard commented 2 years ago

Well, I'm not against improvement. I'm just pointing out that improvements for improvements' own sake won't get you much further down the road. We've already discussed too much, and it's obvious you don't get my points. If you truly believe what you're doing is necessary, just go for it, and let users/app developers decide whether they like your idea.

shanoaice commented 2 years ago

Well let's not talk about compatibility / introducing unnecessary complexity related stuff. Is this proposal (Shadowsocks 2022) cryptographically secure? I do think this is the key to the discussion.

database64128 commented 2 years ago

Shadowsocks-NET/outline-ss-server is now ready for use. To test it, build the client and the server:

go build ./cmd/outline-ss-client
go build ./cmd/outline-ss-server

Benchmark Results on i5-7400

outline-ss-{server,client} TCP UDP (packet loss <=0.5%)
aes-256-gcm 7.85Gbps 1.83Gbps
chacha20-poly1305 6.20Gbps 1.83Gbps
2022-blake3-aes-256-gcm 10.9Gbps 1.80Gbps
2022-blake3-chacha20-poly1305 7.99Gbps 1.80Gbps
aes-256-gcm TCP UDP (packet loss <=0.5%)
ss-rust 20220114 with crypto2 7.28Gbps 1.98Gbps
ss-rust 20220222 7.55Gbps 1.76Gbps
ss-rust 20220307 7.55Gbps 1.77Gbps
ss-rust 20220317 8.02Gbps 1.79Gbps
database64128 commented 2 years ago

Shadowsocks 2022's TCP is faster than Shadowsocks AEAD, because the chunk payload size limit 0x3fff is dropped. (Shadowsocks-NET/outline-ss-server@f1db52706cd6a8949b90cac02fdab8e09d35ff8a)

Shadowsocks 2022's UDP turns out to be a little bit slower than Shadowsocks AEAD, probably because:

madeye commented 2 years ago

I'm okay with @database64128's proposal, at least it solves the replay issue forever.

Maybe @zonyitoo can first try to add 2022-blake3-chacha20-poly1305 in shadwosocks-rust and do some benchmarks.

I don't see benefits to add ciphers like 2022-blake3-aes-128-gcm and 2022-blake3-aes-256-gcm, the encryption performance is not a problem anymore in 2022. Even my grandma got a new phone with 8-core ARMv8 CPU.

zonyitoo commented 2 years ago

Haha. But if everyone uses devices with modern CPUs, should we use AES rather than Chacha20?

zonyitoo commented 2 years ago

Well then, please @database64128 move the proposal to a separated issue and assign a SIP00X to it.

This issue could be closed.

database64128 commented 2 years ago

The previous UDP benchmark method is flawed. It appears iperf3 would become its own bottleneck when a data rate is specified with -b. These new rounds of benchmarks were run with -b 0 to send UDP packets at full speed.

outline-ss-{server,client} UDP
aes-256-gcm 8.37Gbps
chacha20-poly1305 6.97Gbps
2022-blake3-aes-256-gcm 11.5Gbps
2022-blake3-chacha20-poly1305 8.30Gbps
Mygod commented 2 years ago

Why can't you guys keep shadowsocks slow so that the rest of us could have some bandwidth during busy times?

sadge

zonyitoo commented 2 years ago

Why can't you guys keep shadowsocks slow so that the rest of us could have some bandwidth during busy times?

sadge

Hahaha, LOL

zonyitoo commented 2 years ago

Some thoughts about the chosen method chacha20-poly1305:

The ChaCha20 stream cipher was designed to have faster software performance. But since we are talking about a new protocol, and we all agree that most of the users that using shadowsocks are using modern devices with advanced processors, so:

Modern devices should support SIMD and AES instructions, which means that AES-GCM will get a lot faster than ChaCha20-Poly1305. (At least 50%). Why would we care about the software performance?

Why should we keep using this chacha20-poly1305 as the default method in this new protocol?

I suggest the new protocol (2022) should use aes-256-gcm as the default (must support) method.

On the other hand, in the original proposal, the UDP protocol have to use xchacha20-poly1305, which is quite inconsistency for implementors!! I have to use different ciphers for TCP and UDP! And there is no fast enough implementation of xchacha20-poly1305 in the Rust world.

zonyitoo commented 2 years ago

A proposal of method expression:

[SPEC_VER]#[SPEC_METHOD]

The symbol # is chosen for distinguishing from the -, which is commonly used in names, like aes-256-gcm. And the # was commonly used to represent ordinal numbers.

With this expression standard, we can make sure the implementation can recognize and use the specific version of protocol in the future. And older implementation can also reject unrecognized version properly.

Backward compatibilty:

1# and 3# can also be used for stream protocol (deprecated) and AEAD protocol, like

If # was missing in the method, it should be parsed to stream / AEAD as the current implementation does.